Open nikhilrayaprolu opened 3 years ago
@jxhe @muggin
Hi @nikhilrayaprolu,
I faced the same problem with you, it was because the preprocessing script on whole cnndm training dataset took more than 32GB RAM. I would suggest you to split the train set into several parts, then merge them later after preprocess on those parts finished.
thanks for the reply @geeraay
@geeraay can you provide some more explanation on how the splitting and merging is done. Any accompanying code would really be helpful.
I don't remember the exact step I've done back then, but the idea is this.
I did something like
split -n l/${nsplit} /path-to-file/train.source /path-to-file/train.source.
it will create train.source.00, train.source.01, ... , train.source.${nsplit}
Then I rename the generated files to
train_1.source, train_2.source, ..., train_${nsplit}.source.
After that you could run
python scripts/preprocess.py cnndm --mode pipeline --split train_1,train_2,...,train_${nsplit}
wait until the preprocessing step is done, then I manually copy and paste the generated files into one big train.source
file.
Or you can simply use bigger RAM machine to preprocess without splitting the file.
🐛 Bug
On executing
python scripts/preprocess.py cnndm --mode pipeline
Preprocessing stuck at this point:some of the oraclewords are not generated too.
Environment
pip
, source): source