Implement "voxelized" tokenizer

kjappelbaum commented 5 months ago

in https://arxiv.org/pdf/2305.05708.pdf

kjappelbaum commented 5 months ago

in Meta's paper (/cc @smiret-intel)

kjappelbaum commented 5 months ago

For building the tokenizer we can do two routes:

Observe the extremes of the coordinates and then add everything between the extremes
- This has an interplay with the resolution
- This is what the paper from the Aspuru-Guzik group did
- This might be easier with fractional coordinates
Fit on some observed data and then only put those in the vocab

the second approach will limit generalizability, the first will give a very large vocab. Are there any other things that come to mind that we should consider, @smiret-intel , @n0w0f ?

n0w0f commented 5 months ago

I am lookin at Regression Transfomer tokenizer implementation in this branch.

Pros:

Smaller Vocab
no resolution issues

Cons:

Requires Pretraining ? ( Not a similar treatment of number, as seen in the pretraining corpus by big models)

lamalab-org / xtal2txt

Implement "voxelized" tokenizer #49