Closed oraveczcsaba closed 7 years ago
The current preprocess is not so efficient. Here are some ideas on tweaking it:
Otherwise, you can break up into shards.
So in the end I went for the shards and hacked the slicing up stuff from preprocess-shards.py into preprocess.py and now a 500k segment slice takes up about 12-14Gb with alignment, which is manageable.
The preprocess.py script initializes a couple of matrices for various data storage. For big training datasets (we are now trying to train with some 12M segments) this seems to need large amounts of memory, especially if we want to use guided alignment (I might not be right but I would roughly estimate it to hundreds of GBs: alignments = np.zeros((num_sents,newseqlength,newseqlength), dtype=np.uint8) with 12M segments, and a maximum length of about 80 tokens per segment).
Would there be some quick and easy way of avoiding the MemoryError we get here and running such a training with some 64 GB of memory only?