maxhodak / keras-molecules

Autoencoder network for learning a continuous representation of molecular structures.
MIT License
519 stars 146 forks source link

Preprocess uses too much RAM #35

Open dakoner opened 7 years ago

dakoner commented 7 years ago

preprocess.py loads the entire pre-preprocessed data into RAM, does transforms that require more RAM. I'm trying to preprocess GDB-17 w/ 50M SMILES strings and it's just about filling up my 64GB RAM machine. We should be able to go directly from SMILES input to preprocessed input files with far fewer resources although it would take work to make all the steps incremental.

pechersky commented 7 years ago

Would we be alright to switching to generator based training? That gets rid of the need to preprocess and the need to compress as well.

On Mon, Nov 14, 2016 at 5:56 PM, dakoner notifications@github.com wrote:

preprocess.py loads the entire pre-preprocessed data into RAM, does transforms that require more RAM. I'm trying to preprocess GDB-17 w/ 50M SMILES strings and it's just about filling up my 64GB RAM machine. We should be able to go directly from SMILES input to preprocessed input files with far fewer resources although it would take work to make all the steps incremental.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/maxhodak/keras-molecules/issues/35, or mute the thread https://github.com/notifications/unsubscribe-auth/AFGDhk9C4mGWZ0L3uGPaX8uJpC4iJYBOks5q-Oc4gaJpZM4Kx6jM .

maxhodak commented 7 years ago

sure! send a pull request!

pechersky commented 7 years ago

@dakoner can you provide a link to the 50M GDB-17 dataset you're using?

dakoner commented 7 years ago

http://gdb.unibe.ch/downloads/ Specifically http://gdb.unibe.ch/cdn/GDB17.50000000.smi.gz

On Mon, Nov 21, 2016 at 11:03 AM, Yakov Pechersky notifications@github.com wrote:

@dakoner https://github.com/dakoner can you provide a link to the 50M GDB-17 dataset you're using?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/maxhodak/keras-molecules/issues/35#issuecomment-262034167, or mute the thread https://github.com/notifications/unsubscribe-auth/AHtyQPyn1sfpFzWIyVX8h_IJ30xith-xks5rAerygaJpZM4Kx6jM .

pechersky commented 7 years ago

The following branch should be able to train using a stream-based approach, requiring way less RAM. It also provides a solution for issue #39.

Please test it out -- you'll need to edit train_gen.py to pass in some iterable that has SMILES strings in it. Right now, it expects an h5 file with a structure column; it could be a plain gzip text file instead.

https://github.com/pechersky/keras-molecules/tree/stream-process

dakoner commented 7 years ago

Yakov,

Do I understand correctly that now there is no need to preprocess SMILES input at all? IE, this code is buffering the SMILES input in RAM (in pandas), then sampling from the data at runtime?

I am curious about the memory overhead of storing HDF5/pandas data vs. an array of strings.

On Mon, Nov 21, 2016 at 2:15 PM, Yakov Pechersky notifications@github.com wrote:

The following branch should be able to train using a stream-based approach, requiring way less RAM. It also provides a solution for issue

39 https://github.com/maxhodak/keras-molecules/issues/39.

Please test it out -- you'll need to edit train_gen.py to pass in some iterable that has SMILES strings in it. Right now, it expects an h5 file with a structure column; it could be a plain gzip text file instead.

https://github.com/pechersky/keras-molecules/tree/stream-process

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/maxhodak/keras-molecules/issues/35#issuecomment-262084705, or mute the thread https://github.com/notifications/unsubscribe-auth/AHtyQDpR8m2HLF65MoJnM-u4KKVt558Oks5rAhfngaJpZM4Kx6jM .

dakoner commented 7 years ago

I had to make a few changes to train_gen.py to get things to work (see my comments on your commit).

I can confirm it starts training w/o any preprocessing steps.

On Mon, Nov 21, 2016 at 4:33 PM, David Konerding dakoner@gmail.com wrote:

Yakov,

Do I understand correctly that now there is no need to preprocess SMILES input at all? IE, this code is buffering the SMILES input in RAM (in pandas), then sampling from the data at runtime?

I am curious about the memory overhead of storing HDF5/pandas data vs. an array of strings.

On Mon, Nov 21, 2016 at 2:15 PM, Yakov Pechersky <notifications@github.com

wrote:

The following branch should be able to train using a stream-based approach, requiring way less RAM. It also provides a solution for issue

39 https://github.com/maxhodak/keras-molecules/issues/39.

Please test it out -- you'll need to edit train_gen.py to pass in some iterable that has SMILES strings in it. Right now, it expects an h5 file with a structure column; it could be a plain gzip text file instead.

https://github.com/pechersky/keras-molecules/tree/stream-process

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/maxhodak/keras-molecules/issues/35#issuecomment-262084705, or mute the thread https://github.com/notifications/unsubscribe-auth/AHtyQDpR8m2HLF65MoJnM-u4KKVt558Oks5rAhfngaJpZM4Kx6jM .

dakoner commented 7 years ago

This code doesn't seem to train-up quickly like the old train.py does. It gets stuck at an accuracy of about 0.07 while the old train.py jumps up to 50% in just a few batches.

I am seeing a warning at the end of the epoch:

/home/dek/keras-molecules/env/src/keras/keras/engine/training.py:1480: UserWarning: Epoch comprised more than samples_per_epoch samples, which might affect learning results. Set samples_per_epoch correctly to avoid this warning. warnings.warn('Epoch comprised more than '

when I set --epoch_size to 50000

On Mon, Nov 21, 2016 at 4:54 PM, David Konerding dakoner@gmail.com wrote:

I had to make a few changes to train_gen.py to get things to work (see my comments on your commit).

I can confirm it starts training w/o any preprocessing steps.

On Mon, Nov 21, 2016 at 4:33 PM, David Konerding dakoner@gmail.com wrote:

Yakov,

Do I understand correctly that now there is no need to preprocess SMILES input at all? IE, this code is buffering the SMILES input in RAM (in pandas), then sampling from the data at runtime?

I am curious about the memory overhead of storing HDF5/pandas data vs. an array of strings.

On Mon, Nov 21, 2016 at 2:15 PM, Yakov Pechersky < notifications@github.com> wrote:

The following branch should be able to train using a stream-based approach, requiring way less RAM. It also provides a solution for issue

39 https://github.com/maxhodak/keras-molecules/issues/39.

Please test it out -- you'll need to edit train_gen.py to pass in some iterable that has SMILES strings in it. Right now, it expects an h5 file with a structure column; it could be a plain gzip text file instead.

https://github.com/pechersky/keras-molecules/tree/stream-process

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/maxhodak/keras-molecules/issues/35#issuecomment-262084705, or mute the thread https://github.com/notifications/unsubscribe-auth/AHtyQDpR8m2HLF65MoJnM-u4KKVt558Oks5rAhfngaJpZM4Kx6jM .

pechersky commented 7 years ago

You might have gotten the epoch warning if your batch_size doesn't cleanly divide epoch_size.

Thanks for your comments here and on the commit. Could you share the command that you were using to run the old train.py? The only thing that I can think of as being different in this case is that sampling is with replacement -- the code as I have it currently ignores the weights on training.

pechersky commented 7 years ago

@dakoner There was a bug in encoding, it wasn't properly encoding padded words. I've also fixed the bugs you've pointed out. Now train_gen quickly reaches >60% accuracy within the first epoch.

dakoner commented 7 years ago

Thanks. I changed batch_size and the error went away.

Also, I pulled your newer code and can confirm the training is working as expected.

On Tue, Nov 22, 2016 at 2:28 PM, Yakov Pechersky notifications@github.com wrote:

@dakoner https://github.com/dakoner There was a bug in encoding, it wasn't properly encoding padded words. Now train_gen quickly reaches >60% accuracy within the first epoch.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/maxhodak/keras-molecules/issues/35#issuecomment-262385142, or mute the thread https://github.com/notifications/unsubscribe-auth/AHtyQPibQckgg8eZiOu0XvysGZ9jqgicks5rA2yXgaJpZM4Kx6jM .

dakoner commented 7 years ago

Yakov I've got a question about how the new generator-based training works

IIUC it loads the entire dataset into RAM, defines an index point in the dataset, and then returns samples from before or after the index point depending on whether a training or test sample is requested.

if you set an epoch size larger than the dataset size, doesn't that cause no test samples to be returned?

On Tue, Nov 22, 2016 at 6:27 PM, David Konerding dakoner@gmail.com wrote:

Thanks. I changed batch_size and the error went away.

Also, I pulled your newer code and can confirm the training is working as expected.

On Tue, Nov 22, 2016 at 2:28 PM, Yakov Pechersky <notifications@github.com

wrote:

@dakoner https://github.com/dakoner There was a bug in encoding, it wasn't properly encoding padded words. Now train_gen quickly reaches

60% accuracy within the first epoch.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/maxhodak/keras-molecules/issues/35#issuecomment-262385142, or mute the thread https://github.com/notifications/unsubscribe-auth/AHtyQPibQckgg8eZiOu0XvysGZ9jqgicks5rA2yXgaJpZM4Kx6jM .

pechersky commented 7 years ago

The sampling is with replacement, so any epoch size can be used. I chose to use "with replacement" to make the generator have as least state as possible. For some of the datasets I am trying, the number of samples is so much larger than reasonable epoch sizes, I rely on random sampling for each epoch, where no epoch is promised to have the same samples as a previous one.

I am working on extending the "CanonicalSmilesDataGenerator" to take a SMILES string, and create non-canonical representations of the same compound -- similar to how image data generators add noise through rotation etc. This will create an even larger dataset (finite, yet difficult to enumerate). In that case, sampling with replacement is important since you'd be fine with having two different representations of the same compound in a single epoch.

dakoner commented 7 years ago

Great. I am currently training with a 35K smiles string dataset, and I set --epoch_size to 35K. Tensorflow says it's training on 35K samples, does that mean it's really samplign from a training set of 35K/2 items and a test set of 35K/2 items?

On Wed, Nov 23, 2016 at 8:31 AM, Yakov Pechersky notifications@github.com wrote:

The sampling is with replacement, so any epoch size can be used. I chose to use "with replacement" to make the generator have as least state as possible. For some of the datasets I am trying, the number of samples is so much larger than reasonable epoch sizes, I rely on random sampling for each epoch, where no epoch is promised to have the same samples as a previous one.

I am working on extending the "CanonicalSmilesDataGenerator" to take a SMILES string, and create non-canonical representations of the same compound -- similar to how image data generators add noise through rotation etc. This will create an even larger dataset (finite, yet difficult to enumerate). In that case, sampling with replacement is important since you'd be fine to have two different representations of the same compound in a single epoch.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/maxhodak/keras-molecules/issues/35#issuecomment-262564554, or mute the thread https://github.com/notifications/unsubscribe-auth/AHtyQPfZDJuDkCO2rqMfQjIQu4pAz6bVks5rBGpPgaJpZM4Kx6jM .

pechersky commented 7 years ago

In the molecules.vectorizer, SmilesDataGenerator takes a test_split optional parameter that creates the "index point" you mentioned. By default, it is 0.20, so 4/5 of the data is used for training, and 1/5 is used for testing. The version you pulled has a hard coded epoch size for the testing -- I just pushed a new version that figures out how to scale the test-epoch size based on this test_split param.

I'm also making a PR with that branch. Thank you for testing this, @dakoner. @maxhodak, if you could also test it and validate it for your purposes, especially the new sampling code, that would be great. I'd only like to merge it in if it works better than our current approaches (for processing, training, testing, sampling) in terms of memory load and ease of use.

pechersky commented 7 years ago

I should add that if you are training on 35K, and you assume the default test_split=0.20, then your "true" effective training set size is 28K. That's the epoch size you'll want to use, which will also give you a test-epoch of 7K.

dakoner commented 7 years ago

BTW, you should be able to use the smilesparser to generate non-canonical SMILES strings from canonical. The operation set includes permuting the order of branches, starting the string from various terminals, etc.

On Wed, Nov 23, 2016 at 8:42 AM, Yakov Pechersky notifications@github.com wrote:

In the molecules.vectorizer, SmilesDataGenerator takes a test_split optional parameter that creates the "index point" you mentioned. By default, it is 0.20, so 4/5 of the data is used for training, and 1/5 is used for testing. The version you pulled has a hard coded epoch size for the testing -- I just pushed a new version that figures out how to scale the test-epoch size based on this test_split param.

I'm also making a PR with that branch. Thank you for testing this, @dakoner https://github.com/dakoner. @maxhodak https://github.com/maxhodak, if you could also test it and validate it for your purposes, especially the new sampling code, that would be great. I'd only like to merge it in if it works better than our current approaches (for processing, training, testing, sampling) in terms of memory load and ease of use.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/maxhodak/keras-molecules/issues/35#issuecomment-262567557, or mute the thread https://github.com/notifications/unsubscribe-auth/AHtyQM_ShbsELPmibMmWsuNEc6C0_Q_Mks5rBGzbgaJpZM4Kx6jM .

dakoner commented 7 years ago

In this case, I'm training on a file with 1.3M compounds. I believe your comment only applies if I'm providing a small input file and setting --epoch_size to a value that is close to the input file size... right? Anyway, seems like I should be passing --test_split as a parameter, and the code should do the work of figuring out everything else.

On Wed, Nov 23, 2016 at 8:56 AM, Yakov Pechersky notifications@github.com wrote:

I should add that if you are training on 35K, and you assume the default test_split=0.20, then your "true" effective training set size is 28K. That's the epoch size you'll want to use, which will also give you a test-epoch of 7K.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/maxhodak/keras-molecules/issues/35#issuecomment-262571441, or mute the thread https://github.com/notifications/unsubscribe-auth/AHtyQMOKsAV9yR80fGvHgyP6oxR_eu1Cks5rBHAhgaJpZM4Kx6jM .

pechersky commented 7 years ago

I've just pushed a change which allows you to pass test_split as an optional command-line argument. Test-epoch size will be equal to epoch_size / int((1 - test_split) / test_split). The comment I made previously isn't so important with large dataset sizes, I guess. Still, test-epoch-size will be 4 times less than the train-epoch-size with a test_split of 0.20, not 5 times less.

dakoner commented 7 years ago

FWIW I spent about 3 days training a model with what I pulled (from 4 days ago) from your fork. It got up to about 95%, I didn't go any further, but I believe with another day or two of training it would reach the same level the older code did (99.5% accuracy for my data set).

So, I don't think there are any functional regressions associated with the generative training, and I think it has a big advantage of not needing two large .h5 files.

I would prefer if this or something similar gets merged.

On Wed, Nov 23, 2016 at 9:10 AM, Yakov Pechersky notifications@github.com wrote:

I've just pushed a change which allows you to pass test_split as an optional command-line argument. Test-epoch size will be equal to epoch_size / int((1 - test_split) / test_split). The comment I made previously isn't so important with large dataset sizes, I guess. Still, test-epoch-size will be 4 times less than the train-epoch-size with a test_split of 0.20, not 5 times less.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/maxhodak/keras-molecules/issues/35#issuecomment-262575321, or mute the thread https://github.com/notifications/unsubscribe-auth/AHtyQOIzw15jLqj5ZJ34QimDoTOehF-xks5rBHN4gaJpZM4Kx6jM .

pechersky commented 7 years ago

43 got merged 3 days ago, it seems. After 50 epochs of 600,000 strings (batch size 300) using the gen-method, I got 97% acc. After 150 epochs, I've plateau'd out at ~98.5% acc. Does this also fix #38 , do you think? I don't really have a validation set, so I can't say -- but I do notice that still, the molecules that start with N do not get autoencoded to results that start with N.