seyonechithrananda / bert-loves-chemistry

bert-loves-chemistry: a repository of HuggingFace models applied on chemical SMILES data for drug design, chemical modelling, etc.
MIT License
383 stars 60 forks source link

Tokenize incorrect when getting pretrianed feature #60

Open Chris-Tang6 opened 9 months ago

Chris-Tang6 commented 9 months ago

Hi ChemBERTa team🤗, I got a problem when I tokenize smiles seq, And I found that in you example smiles has the Cl atom. So I want to konw if you meet the same question.

The Iuput seq is COC1=C(C=C2C(=C1)CCN=C2C3=CC(=C(C=C3)Cl)Cl)Cl However the output of tokenizer incorrectly labeled the Cl as C, I don't konw how the fix it. And I also found that in the ChenBERTa token table, it indeed has the 'Cl' token. image

The output of the tokenize() result as follow: ['C', 'O', 'C', '1', '=', 'C', '(', 'C', '=', 'C', '2', 'C', '(', '=', 'C', '1', ')', 'C', 'C', 'N', '=', 'C', '2', 'C', '3', '=', 'C', 'C', '(', '=', 'C', '(', 'C', '=', 'C', '3', ')', 'C', ')', 'C', ')', 'C']

Originally posted by @Chris-Tang6 in https://github.com/seyonechithrananda/bert-loves-chemistry/issues/58#issuecomment-1784738131

MrsW6 commented 8 months ago

yes,same question

Lawwwwo commented 1 month ago

same question for me, not only 'Cl' to 'C', like '[Zn]' to 'n' etc...