How do you tokenize the SMILES representation?

jrwnter / cddd

Implementation of the Paper "Learning Continuous and Data-Driven Molecular Descriptors by Translating Equivalent Chemical Representations" by Robin Winter, Floriane Montanari, Frank Noe and Djork-Arne Clevert.

MIT License

220 stars 72 forks source link

How do you tokenize the SMILES representation? #5

Closed Dinxin closed 5 years ago

jrwnter commented 5 years ago

Tokenization is done here: https://github.com/jrwnter/cddd/blob/8c1f90d1d927962658be6b9086f2cb5fa3403ae9/cddd/input_pipeline.py#L139-L150 The SMILES string is basically splitted in individual characters such as C,N,Cl,1,[, etc. Afterwards, each of this characters is one-hot encoded.