maxhodak / keras-molecules

Autoencoder network for learning a continuous representation of molecular structures.
MIT License
519 stars 146 forks source link

training accuracy @ 99.99%, validation never goes above 97% #38

Closed dakoner closed 7 years ago

dakoner commented 7 years ago

This is probably more of a machine learning issue in general, but in training keras-molecules on a small input set (~50K strings), the training eventually reaches 99.99% accuracy (loss ~.01), but the validation never exceeds 97%.

My feeling is this is a sign of overfitting the training data. The trainer has gotten effectively perfect at reconstructing the input SMILES strings but can't generalize to the odd cases in the test set that have no analogs in the training set.

I'm not sure how to debug this or if it really even represents a bug.

pechersky commented 7 years ago

I think we should have a conversation of what "accuracy" really means. Should we be counting the spaces that pad the string? How many spaces do we count?

I have been training a particular model on ~32 million strings, and I am still unable to get good results on SMILES that start with N, for example. This is not a bug, it is an artifact of not feeding enough data, nor weighing the sampling of data enough to prioritize training to fit to it.

dakoner commented 7 years ago

I believe the "accuracy" described by the training system is evaluated over the entire string (120 char). To me this feels correct, because of the constraint of the encoding, and because poorly trained networks will actually produce strings with non-space characters where the input string had spaces, or vice-versa. For example, when the model is at about 50-60% accuracy, it's already predicting strings with the same length as the average length of SMILES string in the training set.

I haven't looked at the models to see if particular things are poorly predicted like your case. I checked ZINC and not a single SMILES string starts with N (this may actually be normal depending on the canonicalization?)

Also, I think the short answer to my question is "use hyperparmeter search": run on progressively larger input training sets until you see test accuracy keeping up with training accuracy.

On Fri, Nov 18, 2016 at 6:34 AM, Yakov Pechersky notifications@github.com wrote:

I think we should have a conversation of what "accuracy" really means. Should we be counting the spaces that pad the string? How many spaces do we count?

I have been training a particular model on ~32 million strings, and I am still unable to get good results on SMILES that start with N, for example. This is not a bug, it is an artifact of not feeding enough data, nor weighing the sampling of data enough to prioritize training to fit to it.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/maxhodak/keras-molecules/issues/38#issuecomment-261545844, or mute the thread https://github.com/notifications/unsubscribe-auth/AHtyQBD_vNRhkpY4jGmBLbJ8Gu6biiWxks5q_beHgaJpZM4K2Ach .

pechersky commented 7 years ago

Allowing the model to calculate loss over all the padded spaces leads to it preferring to get length and spaces right over the actual string itself, in my experience. Since the SMILES is built up left-to-right, the leftmost symbols have the highest value or contribution to whether a SMILES will be valid. Because of that, I don't think that spaces after a certain level of padding should be used to calculate acc or loss, since one can always clip the output at the first space, or the first adjacent two spaces, etc.

Here is a simple example of a canonical SMILES that begins with N: Nc1cncc(N)n1.

dakoner commented 7 years ago

Hmm, I agree that marking the accuracy the way we do forces the network to learn the length bias first.

From my analysis of errors in autoencoding, I haven't seen the errors to be distributed preferentially in any location- it seems more like the problems are violations of valence rules or parenthesis nesting in the middle of the molecule

BTW, I was thinking that "center padding", rather than left justification would work better because it places the zero-point of the strings in the middle of the spaces.

On Fri, Nov 18, 2016 at 7:02 AM, Yakov Pechersky notifications@github.com wrote:

Allowing the model to calculate loss over all the padded spaces leads to it preferring to get length and spaces right over the actual string itself, in my experience. Since the SMILES is built up left-to-right, the leftmost symbols have the highest value or contribution to whether a SMILES will be valid. Because of that, I don't think that spaces after a certain level of padding should be used to calculate acc or loss, since one can always clip the output at the first space, or the first adjacent two spaces, etc.

Here is a simple example of a canonical SMILES that begins with N: Nc1cncc(N)n1.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/maxhodak/keras-molecules/issues/38#issuecomment-261552678, or mute the thread https://github.com/notifications/unsubscribe-auth/AHtyQLmpwJmPqTEi1ugGSHjs6IXO6WYyks5q_b4UgaJpZM4K2Ach .

pechersky commented 7 years ago

I like the "center padding" idea. In fact, if we switch to a generator based system, we can expand our data by generating all left paddings. In general, SMILES strings are valid with any amount of space surrounding it -- perhaps having the network learn all possible positioning of the symbols will remove "position-dependent" effects like expecting a particular substring only in positions x-through-y.

dakoner commented 7 years ago

As a followup, I've tested center-padding. The model trains up to about the same level of accuracy without any other real differences, so I don't think it really represents an improvement. Unfortunately, it takes my GTX1080 over 500 epochs and 3+ days to reach ~97+% accuracy.

On Fri, Nov 18, 2016 at 8:35 AM, Yakov Pechersky notifications@github.com wrote:

I like the "center padding" idea. In fact, if we switch to a generator based system, we can expand our data by generating all left paddings. In general, SMILES strings are valid with any amount of space surrounding it -- perhaps having the network learn all possible positioning of the symbols will remove "position-dependent" effects like expecting a particular substring only in positions x-through-y.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/maxhodak/keras-molecules/issues/38#issuecomment-261577071, or mute the thread https://github.com/notifications/unsubscribe-auth/AHtyQNHElkgrj03N_z6_SzNVjtQYL6rxks5q_dPlgaJpZM4K2Ach .

dakoner commented 7 years ago

I think this is just a training set size issue, nothing specific to the implementation

michaelosthege commented 7 years ago

@dakoner how big was your training set that took so long to train? I'm currently at around 66 % accuracy after 380 000 training examples.

dakoner commented 7 years ago

I can't recall if it was the subset of GDB-17, or more likely, the "drug-like clean" subset of ZINC.

On Wed, Jan 25, 2017 at 5:08 PM, michaelosthege notifications@github.com wrote:

@dakoner https://github.com/dakoner how big was your training set that took so long to train? I'm currently at around 66 % accuracy after 380 000 training examples.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/maxhodak/keras-molecules/issues/38#issuecomment-275282216, or mute the thread https://github.com/notifications/unsubscribe-auth/AHtyQARJyFDxXMcWwHOZexnrwRj3wVQkks5rV_IngaJpZM4K2Ach .