I collected a large smiles data set. I wanted to try to generate the model from scratch. Then I counted the unique characters of all smiles, as follows:
#%()*+-./0123456789:=@ABCDEFGHIKLMNOPRSTUVWXYZ[\\]abcdefghiklmnoprstuy
Then I read the smiles data file you provided chembl_22_clean_1576904_sorted_std_final.smi,Get the unique character of smiles,But I found that token is not equal to token in `JAK2_min_max_demo.ipynb':
chem_smiles = read_smi_file("ReLeaSE/data/chembl_22_clean_1576904_sorted_std_final.smi")
ch_smiles = [i.split("\t")[0] for i in chem_smiles[0]]
tokens2 = list(set(''.join(ch_smiles)))
tokens2 = list(np.sort(tokens))
tokens2 = ''.join(tokens)
The token2 result is:#%()+-./0123456789=BCFHINOPS[\\]clnoprs
Except that <and >'denote beginning and ending, token1 andtoken2 are not equal, why is that? What did you do with thechembl_22_clean_1576904_sorted_std_final.smi?
Can you give me more guidance? Thank you very much.
I collected a large smiles data set. I wanted to try to generate the model from scratch. Then I counted the unique characters of all smiles, as follows:
#%()*+-./0123456789:=@ABCDEFGHIKLMNOPRSTUVWXYZ[\\]abcdefghiklmnoprstuy
But I see in your `JAK2_min_max_demo.ipynb',
Then I read the smiles data file you provided
chembl_22_clean_1576904_sorted_std_final.smi
,Get the unique character of smiles,But I found that token is not equal to token in `JAK2_min_max_demo.ipynb':The token2 result is:
#%()+-./0123456789=BCFHINOPS[\\]clnoprs
Except that
<
and>'denote beginning and ending,
token1and
token2are not equal, why is that? What did you do with the
chembl_22_clean_1576904_sorted_std_final.smi?Can you give me more guidance? Thank you very much.