Memory requirements in preprocessing

oraveczcsaba commented 7 years ago

The preprocess.py script initializes a couple of matrices for various data storage. For big training datasets (we are now trying to train with some 12M segments) this seems to need large amounts of memory, especially if we want to use guided alignment (I might not be right but I would roughly estimate it to hundreds of GBs: alignments = np.zeros((num_sents,newseqlength,newseqlength), dtype=np.uint8) with 12M segments, and a maximum length of about 80 tokens per segment).

Would there be some quick and easy way of avoiding the MemoryError we get here and running such a training with some 64 GB of memory only?

yoonkim commented 7 years ago

The current preprocess is not so efficient. Here are some ideas on tweaking it:

we only need one of target and target_output (will cut memory by 1/3)
can potentially get rid of all padding by just saving everything as one long array and relevant offsets (and making them into matrices in data.lua)

Otherwise, you can break up into shards.

oraveczcsaba commented 7 years ago

So in the end I went for the shards and hacked the slicing up stuff from preprocess-shards.py into preprocess.py and now a 500k segment slice takes up about 12-14Gb with alignment, which is manageable.

harvardnlp / seq2seq-attn

Memory requirements in preprocessing #72