nlpyang / PreSumm

code for EMNLP 2019 paper Text Summarization with Pretrained Encoders
MIT License
1.28k stars 465 forks source link

Threading Error when running extractive summarizer on multiple GPUs #192

Open cmkumar87 opened 4 years ago

cmkumar87 commented 4 years ago

Command line used to run the code: python train.py -task ext -mode train -bert_data_path ../bert_data/cnndm/cnndm -ext_dropout 0.1 -model_path ../models/ -lr 2e-3 -visible_gpus 0,1,2,3,4,5,6,7 -report_every 50 -save_checkpoint_steps 1000 -batch_size 3000 -train_steps 50000 -accum_count 2 -log_file ../logs/ext_bert_cnndm -use_interval true -warmup_steps 10000 -max_pos 512

Exception in thread Thread-1: Traceback (most recent call last): File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/threading.py", line 916, in _bootstrap_inner self.run() File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/tensorboardX/event_file_writer.py", line 202, in run data = self._queue.get(True, queue_wait_duration) File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/queues.py", line 108, in get res = self._recv_bytes() File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes buf = self._recv_bytes(maxlength) File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes buf = self._recv(4) File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/connection.py", line 383, in _recv raise EOFError EOFError

zyxnlp commented 3 years ago

Same Issue, does anyone know how to solve that?

cmkumar87 commented 3 years ago

@ZhouyxNLP Did that resolve? I retrained the whole thing and it didn't happen. As you know threading errors can be such.

zyxnlp commented 3 years ago

@ZhouyxNLP Did that resolve? I retrained the whole thing and it didn't happen. As you know threading errors can be such.

@cmkumar87 No, I didn't. I retrain many times, but it still happened. I guess it may be caused by the version of PyTorch I used. May I ask what's the version of PyTorch did you use?

cmkumar87 commented 3 years ago

@ZhouyxNLP Did you try running it Pytorch 3.6? You may have to update a lit bit of syntax in PreSumm/src/prepro/data_builder.py to port the legacy code to recent convention. But this is minor. So did you try to run in in Pytorch 3.6?

zyxnlp commented 3 years ago

@ZhouyxNLP Did you try running it Pytorch 3.6? You may have to update a lit bit of syntax in PreSumm/src/prepro/data_builder.py to port the legacy code to recent convention. But this is minor. So did you try to run in in Pytorch 3.6?

@cmkumar87 Hi, did you mean the python3.6? As far as I know, the latest pytorch is version is 1.8.0. I use python3.8 by the way, and it still has the problem.

cmkumar87 commented 3 years ago

@ZhouyxNLP Did you try running it Pytorch 3.6? You may have to update a lit bit of syntax in PreSumm/src/prepro/data_builder.py to port the legacy code to recent convention. But this is minor. So did you try to run in in Pytorch 3.6?

@cmkumar87 Hi, did you mean the python3.6? As far as I know, the latest pytorch is version is 1.8.0. I use python3.8 by the way, and it still has the problem.

Sorry I copied my conda environment name over here instead of the version name. Yes, pytorch 1.6 or later?

Also btw, I am trying to help you though I don't own this repo. thanks!

zyxnlp commented 3 years ago

@ZhouyxNLP Did you try running it Pytorch 3.6? You may have to update a lit bit of syntax in PreSumm/src/prepro/data_builder.py to port the legacy code to recent convention. But this is minor. So did you try to run in in Pytorch 3.6?

@cmkumar87 Hi, did you mean the python3.6? As far as I know, the latest pytorch is version is 1.8.0. I use python3.8 by the way, and it still has the problem.

Sorry I copied my conda environment name over here instead of the version name. Yes, pytorch 1.6 or later?

Also btw, I am trying to help you though I don't own this repo. thanks!

@cmkumar87 Thank you so much for your kindly help!