Hi ChemBERTa team🤗, I got a problem when I tokenize smiles seq, And I found that in you example smiles has the Cl atom. So I want to konw if you meet the same question.
The Iuput seq is COC1=C(C=C2C(=C1)CCN=C2C3=CC(=C(C=C3)Cl)Cl)Cl
However the output of tokenizer incorrectly labeled the Cl as C, I don't konw how the fix it. And I also found that in the ChenBERTa token table, it indeed has the 'Cl' token.
The output of the tokenize() result as follow:
['C', 'O', 'C', '1', '=', 'C', '(', 'C', '=', 'C', '2', 'C', '(', '=', 'C', '1', ')', 'C', 'C', 'N', '=', 'C', '2', 'C', '3', '=', 'C', 'C', '(', '=', 'C', '(', 'C', '=', 'C', '3', ')', 'C', ')', 'C', ')', 'C']
Hi ChemBERTa team🤗, I got a problem when I tokenize smiles seq, And I found that in you example smiles has the
Cl
atom. So I want to konw if you meet the same question.The Iuput seq is
COC1=C(C=C2C(=C1)CCN=C2C3=CC(=C(C=C3)Cl)Cl)Cl
However the output of tokenizer incorrectly labeled theCl
asC
, I don't konw how the fix it. And I also found that in the ChenBERTa token table, it indeed has the 'Cl' token.The output of the tokenize() result as follow:
['C', 'O', 'C', '1', '=', 'C', '(', 'C', '=', 'C', '2', 'C', '(', '=', 'C', '1', ')', 'C', 'C', 'N', '=', 'C', '2', 'C', '3', '=', 'C', 'C', '(', '=', 'C', '(', 'C', '=', 'C', '3', ')', 'C', ')', 'C', ')', 'C']
Originally posted by @Chris-Tang6 in https://github.com/seyonechithrananda/bert-loves-chemistry/issues/58#issuecomment-1784738131