robinlingwood / BIMODAL

5 stars 30 forks source link

Tokens for double letters atom (Cl and Br) #6

Open albertma-evotec opened 4 years ago

albertma-evotec commented 4 years ago

Hi,

The model is currently working OK for me but I am just curious to know how the double letter atoms (like Cl and Br) are handled in encoding/decoding. I have looked at the one_hot_encoder module. It seems they are treated as 2 tokens (e.g "C" and "l" for chlorine atom). Please correct me if I am wrong because I could not see they are being handled as I thought they should, i.e. replacing these double-letter atoms with a dummy character before doing the one-hot encoding. If chlorine is indeed treated as two tokens, wouldn't it confuse the network as it conflicts with the aliphatic carbon C?

Albert

robinlingwood commented 4 years ago

Hi Albert,

Yes indeed, that is certainly something one could try to improve the performance.