The unnatural encoding of current implementation

maxhodak / keras-molecules

Autoencoder network for learning a continuous representation of molecular structures.

MIT License

519 stars 146 forks source link

The unnatural encoding of current implementation #54

Open hsiaoyi0504 opened 7 years ago

hsiaoyi0504 commented 7 years ago

After testing, I found that the procedure of building a list of the unique characters used in the dataset (The "charset") is wired. Current encoding will make the resulting output much fragile, because we didn't avoid the situation of Cl interpreted as "C", "l". For example, we should treat 'Cl' as independent character rather than 'C' and 'l' directly. It chemically unreasonable to see 'l' along.

grayfall commented 7 years ago

Have this problem ever been addressed? Apart from this the charsets are not stable between different training datasets, yielding incompatible models.

pechersky commented 7 years ago

I suggest checking out the paper and repo I cite in #62. It also has pretrained models if you need that.

grayfall commented 7 years ago

@pechersky do you accept pull requests? I've made some improvements to your preprocessing routine and the CLI. Most importantly, I changed the parsing scheme to address the issues mentioned here.

pechersky commented 7 years ago

Yeah, go ahead and make a PR.

On Thu, Mar 30, 2017 at 8:48 AM, Eli notifications@github.com wrote:

@pechersky https://github.com/pechersky do you accept pull requests? I've made some improvements to your preprocessing routine and the CLI. Most importantly, I changed the parsing scheme to address the issues mentioned here.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/maxhodak/keras-molecules/issues/54#issuecomment-290400244, or mute the thread https://github.com/notifications/unsubscribe-auth/AFGDhiUCRPs3JDSigy6wG3O-EXw4DTFSks5rq6SYgaJpZM4LapUj .