Preprocess stuck - Githubissues

salesforce / ctrl-sum

Resources for the "CTRLsum: Towards Generic Controllable Text Summarization" paper

https://arxiv.org/abs/2012.04281

BSD 3-Clause "New" or "Revised" License

146 stars 24 forks source link

Preprocess stuck #7

Open nikhilrayaprolu opened 3 years ago

nikhilrayaprolu commented 3 years ago

🐛 Bug

On executing python scripts/preprocess.py cnndm --mode pipeline Preprocessing stuck at this point:

some of the oraclewords are not generated too.

Environment

fairseq Version (e.g., 1.0 or master): recommended commit
PyTorch Version (e.g., 1.0) : 1.8
OS (e.g., Linux): Linux
How you installed fairseq (pip, source): source
Build command you used (if compiling from source): source
Python version: 3.6.8
CUDA/cuDNN version: 10.2

nikhilrayaprolu commented 3 years ago

@jxhe @muggin

geeraay commented 3 years ago

Hi @nikhilrayaprolu,

I faced the same problem with you, it was because the preprocessing script on whole cnndm training dataset took more than 32GB RAM. I would suggest you to split the train set into several parts, then merge them later after preprocess on those parts finished.

nikhilrayaprolu commented 3 years ago

thanks for the reply @geeraay

nikhilrayaprolu commented 3 years ago

@geeraay can you provide some more explanation on how the splitting and merging is done. Any accompanying code would really be helpful.

geeraay commented 3 years ago

I don't remember the exact step I've done back then, but the idea is this.

I did something like split -n l/${nsplit} /path-to-file/train.source /path-to-file/train.source. it will create train.source.00, train.source.01, ... , train.source.${nsplit}

Then I rename the generated files to train_1.source, train_2.source, ..., train_${nsplit}.source.

After that you could run python scripts/preprocess.py cnndm --mode pipeline --split train_1,train_2,...,train_${nsplit}

wait until the preprocessing step is done, then I manually copy and paste the generated files into one big train.source file.

Or you can simply use bigger RAM machine to preprocess without splitting the file.