songlab-cal / tape-neurips2019

Tasks Assessing Protein Embeddings (TAPE), a set of five biologically relevant semi-supervised learning tasks spread across different domains of protein biology. (DEPRECATED)
https://arxiv.org/abs/1906.08230
MIT License
118 stars 34 forks source link

Use h5py for output data writing and consolidation to reduce memory footprint #10

Closed thomas-a-neil closed 4 years ago

thomas-a-neil commented 5 years ago

Building on https://github.com/CannyLab/rinokeras/pull/12, the data consolidation step will read the entire output dataset into memory (which will crash for relatively small datasets if we include all encoder outputs, especially for the LSTM).

hdf5 allows us to iteratively write, and avoid the memory overhead of pickle

Upon reflection, some documentation update should probably be done as well, because I think we reference pickle a few time

thomas-a-neil commented 5 years ago

This should also help with https://github.com/songlab-cal/tape/issues/8

rmrao commented 5 years ago

Should we merge this? I don't think the rinokeras changes have been merged to master?

thomas-a-neil commented 5 years ago

It depends on rinokeras changes, so I don't think we can merge it yet.

rmrao commented 4 years ago

Closing since both this and rinokeras are in basic maintenance mode now, so no major changes will be made.