Problem with custom sentencepiece model

ghost commented 3 years ago

Hi guys,

I trained from scratch a new sentencepiece model on my pretraining dataset, however I still get unk tokens. Do you know why? I remember the last summer was working smoothly! Specifically: ⁇ extra_id_0> ⁇ @ ⁇ extra_id_1> Furthermore, the same ?? is present even for curly brace '{'

Thanks in advance

adarob commented 3 years ago

Did you use byte_fallback? Are you installing t5 from github head or from pip? The extra_id change hasn't been pushed into the pip package yet.

On Tue, Jan 5, 2021, 7:21 AM antoniomastro1996 notifications@github.com wrote:

Hi guys,

I trained from scratch a new sentencepiece model on my pretraining dataset, however I still get unk tokens. Do you know why? I remember the last summer was working smoothly! Specifically: ⁇ extra_id_0> ⁇ @ ⁇ extra_id_1> Furthermore, the same ?? is present even for curly brace '{'

Thanks in advance

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/google-research/text-to-text-transfer-transformer/issues/634, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIJV2E67MIGO7GGNAY2HWDSYL725ANCNFSM4VVAHNUQ .

ghost commented 3 years ago

@adarob Hi Rob, thanks a lot for the reply. I installed T5 from pip. I'm adapting your Jupiter notebook and nope, I didn't use byte_fallback. I mean I just created a new sentencepiece model as usual with the standard parameters.

google-research / text-to-text-transfer-transformer

Problem with custom sentencepiece model #634