rycolab / differentiable-subset-pruning

15 stars 1 forks source link

TypeError: can't convert cuda:0 device type tensor to numpy. #1

Closed DujianDing closed 3 years ago

DujianDing commented 3 years ago

Hi,

I read the error prompt "TypeError: can't convert cuda:0 device type tensor to numpy" after trying to execute joint DSP with Enc-Dec on IWSLT. My experiment setting and detailed error messages are attached below. It will be highly appreciated if I can have your thoughts on how to solve this problem. Thank you in advance!

Dependencies: "python 3.7.4; perl 5.22.2; pytorch 1.7.1+cu101" The command to execute run_dsp.py is exactly same as the one in README except for an additional option "--ddp-backend no_c10d". Detailed error messages are as follows,

Traceback (most recent call last): File "run_dsp.py", line 451, in cli_main() File "run_dsp.py", line 447, in cli_main distributed_utils.call_main(cfg, main) File "/scratch/dujian/differentiable-subset-pruning/fairseq/fairseq/distributed_utils.py", line 320, in call_main cfg.distributed_training.distributed_world_size, File "/localscratch/dujian.15168123.0/env/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/localscratch/dujian.15168123.0/env/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes while not context.join(): File "/localscratch/dujian.15168123.0/env/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 118, in join raise Exception(msg) Exception:

-- Process 0 terminated with the following error: Traceback (most recent call last): File "/localscratch/dujian.15168123.0/env/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap fn(i, *args) File "/scratch/dujian/differentiable-subset-pruning/fairseq/fairseq/distributed_utils.py", line 302, in distributed_main main(cfg, kwargs) File "/scratch/dujian/differentiable-subset-pruning/fairseq/examples/pruning/run_dsp.py", line 153, in main valid_losses, should_stop, global_step = train(cfg, trainer, task, epoch_itr, global_step) File "/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/python/3.7.4/lib/python3.7/contextlib.py", line 74, in inner return func(*args, *kwds) File "/scratch/dujian/differentiable-subset-pruning/fairseq/examples/pruning/run_dsp.py", line 281, in train cfg, trainer, task, epoch_itr, valid_subsets, end_of_epoch File "/scratch/dujian/differentiable-subset-pruning/fairseq/examples/pruning/run_dsp.py", line 333, in validate_and_save valid_losses = validate(cfg, trainer, task, epoch_itr, valid_subsets) File "/scratch/dujian/differentiable-subset-pruning/fairseq/examples/pruning/run_dsp.py", line 409, in validate trainer.valid_step(sample) File "/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/python/3.7.4/lib/python3.7/contextlib.py", line 74, in inner return func(args, kwds) File "/scratch/dujian/differentiable-subset-pruning/fairseq/fairseq/trainer.py", line 881, in valid_step logging_output = self._reduce_and_log_stats(logging_outputs, sample_size) File "/scratch/dujian/differentiable-subset-pruning/fairseq/fairseq/trainer.py", line 1204, in _reduce_and_log_stats self.task.reduce_metrics(logging_outputs, self.get_criterion()) File "/scratch/dujian/differentiable-subset-pruning/fairseq/fairseq/tasks/translation.py", line 419, in reduce_metrics metrics.log_scalar("_bleu_counts", np.array(counts)) File "/localscratch/dujian.15168123.0/env/lib/python3.7/site-packages/torch/tensor.py", line 630, in array return self.numpy() TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

zxshamson commented 2 years ago

Hi, do you have any solution for this? I come up with the same error, too.

jiaodali commented 2 years ago

Hi Xingshan, I used only one GPU so I didn't encounter this error. If you want to use multi-gpus, I believe this would help you: https://github.com/pytorch/fairseq/commit/09945b45d4e2608563b1b18c3bbe289bf9351529.

zxshamson commented 2 years ago

Thanks! That solves my error.