are our tokenizers initalized correctly?

lamalab-org / MatText

Text-based modeling of materials.

https://lamalab-org.github.io/MatText/

MIT License

26 stars 2 forks source link

Open kjappelbaum opened 3 months ago

kjappelbaum commented 3 months ago

perhaps not for batch inference

n0w0f commented 3 months ago

For the llama runs we do not use mattext tokenizers though.

n0w0f commented 3 months ago

Ah I see now. There was this issue of Llama tokenizer not including pad token. So we set tokenizer.pad_token = tokenizer.eos_token ref.

We also tried adding a token, this then resized the vocab and creates a set of problems

kjappelbaum commented 2 months ago

This is not an issue for the serial interface that is in the code at the moment. For batched inference this might be important in the future