harvardnlp / seq2seq-attn

Sequence-to-sequence model with LSTM encoder/decoders and attention
http://nlp.seas.harvard.edu/code
MIT License
1.26k stars 278 forks source link

Memory requirements in preprocessing #72

Closed oraveczcsaba closed 7 years ago

oraveczcsaba commented 7 years ago

The preprocess.py script initializes a couple of matrices for various data storage. For big training datasets (we are now trying to train with some 12M segments) this seems to need large amounts of memory, especially if we want to use guided alignment (I might not be right but I would roughly estimate it to hundreds of GBs: alignments = np.zeros((num_sents,newseqlength,newseqlength), dtype=np.uint8) with 12M segments, and a maximum length of about 80 tokens per segment).

Would there be some quick and easy way of avoiding the MemoryError we get here and running such a training with some 64 GB of memory only?

yoonkim commented 7 years ago

The current preprocess is not so efficient. Here are some ideas on tweaking it:

Otherwise, you can break up into shards.

oraveczcsaba commented 7 years ago

So in the end I went for the shards and hacked the slicing up stuff from preprocess-shards.py into preprocess.py and now a 500k segment slice takes up about 12-14Gb with alignment, which is manageable.