lamalab-org / xtal2txt

MIT License
6 stars 0 forks source link

Implement "voxelized" tokenizer #49

Open kjappelbaum opened 5 months ago

kjappelbaum commented 5 months ago
Screenshot 2024-04-19 at 08 39 31 Screenshot 2024-04-19 at 08 40 07 Screenshot 2024-04-19 at 08 40 38

in https://arxiv.org/pdf/2305.05708.pdf

kjappelbaum commented 5 months ago

in Meta's paper (/cc @smiret-intel)

Screenshot 2024-04-19 at 08 42 56
kjappelbaum commented 5 months ago

For building the tokenizer we can do two routes:

the second approach will limit generalizability, the first will give a very large vocab. Are there any other things that come to mind that we should consider, @smiret-intel , @n0w0f ?

n0w0f commented 5 months ago

I am lookin at Regression Transfomer tokenizer implementation in this branch.

Pros:

Cons: