How to determine the token of a new data set

I collected a large smiles data set. I wanted to try to generate the model from scratch. Then I counted the unique characters of all smiles, as follows: #%()*+-./0123456789:=@ABCDEFGHIKLMNOPRSTUVWXYZ[\\]abcdefghiklmnoprstuy

But I see in your `JAK2_min_max_demo.ipynb',

tokens = ['<', '>', '#', '%', ')', '(', '+', '-', '/', '.', '1', '0', '3', '2', '5', '4', '7', '6', '9', '8', '=', 'A', '@', 'C', 'B', 'F', 'I', 'H', 'O', 'N', 'P', 'S', '[', ']','\\', 'c', 'e', 'i', 'l', 'o', 'n', 'p', 's', 'r', '\n']

Then I read the smiles data file you provided chembl_22_clean_1576904_sorted_std_final.smi,Get the unique character of smiles,But I found that token is not equal to token in `JAK2_min_max_demo.ipynb':

chem_smiles = read_smi_file("ReLeaSE/data/chembl_22_clean_1576904_sorted_std_final.smi")
ch_smiles = [i.split("\t")[0] for i in chem_smiles[0]]

tokens2 = list(set(''.join(ch_smiles)))
tokens2 = list(np.sort(tokens))
tokens2 = ''.join(tokens)

The token2 result is:#%()+-./0123456789=BCFHINOPS[\\]clnoprs

Except that <and >'denote beginning and ending, token1 andtoken2 are not equal, why is that? What did you do with thechembl_22_clean_1576904_sorted_std_final.smi?

Can you give me more guidance? Thank you very much.

isayev / ReLeaSE

How to determine the token of a new data set #34