maxhodak / keras-molecules

Autoencoder network for learning a continuous representation of molecular structures.
MIT License
519 stars 146 forks source link

Training and canonicalization #39

Open dakoner opened 7 years ago

dakoner commented 7 years ago

My training has been on the ZINC database, but when I decode generated vectors into SMILES strings, I pass them into rdkit's canonicalizer. It looks to me like ZINC used a different canonicalization mechanism.

It's not clear to me at all whether this would make any difference- obviously, if you train on a set generated by one canonicalizer, the system is going to learn to produce outputs that are consistent with that canonicalizer. In a sense, this is a bias (the canonicalization rules are based mostly in graph theory rather than chemistry).

hsiaoyi0504 commented 7 years ago

I agree that canonicalization is important, and maybe it's simple to do canonicalization by rdkit.