lamalab-org / MatText

Text-based modeling of materials.
https://lamalab-org.github.io/MatText/
MIT License
26 stars 2 forks source link

are our tokenizers initalized correctly? #99

Open kjappelbaum opened 3 months ago

kjappelbaum commented 3 months ago

perhaps not for batch inference

n0w0f commented 3 months ago

For the llama runs we do not use mattext tokenizers though.

n0w0f commented 3 months ago

Ah I see now. There was this issue of Llama tokenizer not including pad token. So we set tokenizer.pad_token = tokenizer.eos_token ref.

We also tried adding a token, this then resized the vocab and creates a set of problems

kjappelbaum commented 2 months ago

This is not an issue for the serial interface that is in the code at the moment. For batched inference this might be important in the future