Training and canonicalization

My training has been on the ZINC database, but when I decode generated vectors into SMILES strings, I pass them into rdkit's canonicalizer. It looks to me like ZINC used a different canonicalization mechanism.

It's not clear to me at all whether this would make any difference- obviously, if you train on a set generated by one canonicalizer, the system is going to learn to produce outputs that are consistent with that canonicalizer. In a sense, this is a bias (the canonicalization rules are based mostly in graph theory rather than chemistry).

maxhodak / keras-molecules

Training and canonicalization #39