linjieli222 / HERO

Research code for EMNLP 2020 paper "HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training"
https://arxiv.org/abs/2005.00200
MIT License
230 stars 34 forks source link

RuntimeError: Mismatched data types: One rank had type int64, but another rank had type uint8. #38

Closed wqf321 closed 2 years ago

wqf321 commented 2 years ago

hi, i got a problem when run the command " horovodrun -np 2 python pretrain.py --config config/pretrain-tv-16gpu.json --output_dir ./pre_train_ckpt/ckpt/ ", could you please help me?

0%|          | 500/100000 [06:57<22:23:43,  1.23it/s][1,0]<stderr>:02/24/2022 15:17:53 - INFO - __main__ -   -------------------------------------------
[1,0]<stderr>:02/24/2022 15:17:53 - INFO - __main__ -   Step 500:
[1,0]<stderr>:02/24/2022 15:17:53 - INFO - __main__ -   mlm_tv_all: 3384 examples trained at 8 ex/s
[1,0]<stderr>:02/24/2022 15:17:53 - INFO - __main__ -   mfm-nce_tv_all: 3192 examples trained at 7 ex/s
[1,0]<stderr>:02/24/2022 15:17:53 - INFO - __main__ -   fom_tv_all: 1968 examples trained at 4 ex/s
[1,0]<stderr>:02/24/2022 15:17:53 - INFO - __main__ -   vsm_tv_all: 3456 examples trained at 8 ex/s
[1,0]<stderr>:02/24/2022 15:17:53 - INFO - __main__ -   ===========================================
[1,0]<stderr>:02/24/2022 15:17:53 - INFO - __main__ -   Step 500: start running validation
[1,0]<stderr>:02/24/2022 15:17:53 - INFO - __main__ -   validate on mlm_tv_all task
[1,0]<stderr>:02/24/2022 15:17:53 - INFO - __main__ -   start running MLM validation...
[1,0]<stderr>:02/24/2022 15:17:55 - INFO - __main__ -   validation finished in 2 seconds, acc: 2.41
[1,0]<stderr>:02/24/2022 15:17:55 - INFO - __main__ -   validate on mfm-nce_tv_all task
[1,0]<stderr>:02/24/2022 15:17:55 - INFO - __main__ -   start running MFM-NCE validation...
[1,0]<stderr>:02/24/2022 15:17:58 - INFO - __main__ -   validation finished in 2 seconds, loss: 15.16, acc: 1.99 (average 350 negatives)
[1,0]<stderr>:02/24/2022 15:17:58 - INFO - __main__ -   validate on fom_tv_all task
[1,0]<stderr>:02/24/2022 15:17:58 - INFO - __main__ -   start running FOM validation...
[1,0]<stderr>:02/24/2022 15:18:01 - INFO - __main__ -   validation finished in 2 seconds, score: 1.92
[1,0]<stderr>:02/24/2022 15:18:01 - INFO - __main__ -   validate on vsm_tv_all task
[1,0]<stderr>:02/24/2022 15:18:01 - INFO - __main__ -   start running VSM validation...
[1,1]<stderr>:Traceback (most recent call last):
[1,1]<stderr>:  File "pretrain.py", line 621, in <module>
[1,1]<stderr>:    main(args)
[1,1]<stderr>:  File "pretrain.py", line 372, in main
[1,1]<stderr>:    validate(model, val_dataloaders, opts)
[1,1]<stderr>:  File "pretrain.py", line 403, in validate
[1,1]<stderr>:    val_log = validate_vsm(model, loader, opts)
[1,1]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 49, in decorate_no_grad
[1,1]<stderr>:    return func(*args, **kwargs)
[1,1]<stderr>:  File "pretrain.py", line 436, in validate_vsm
[1,1]<stderr>:    val_loss_st_ed = sum(all_gather_list(val_loss_st_ed))
[1,1]<stderr>:  File "/mnt/workspace/wangqifan.wqf/antmmf/hero-ex/utils/distributed.py", line 190, in all_gather_list
[1,1]<stderr>:    out_buffer = hvd.allgather(in_buffer[:enc_byte+enc_size])
[1,1]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/horovod/torch/mpi_ops.py", line 287, in allgather
[1,1]<stderr>:    return HorovodAllgather.apply(tensor, name)
[1,1]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/horovod/torch/mpi_ops.py", line 250, in forward
[1,1]<stderr>:    return synchronize(handle)
[1,1]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/horovod/torch/mpi_ops.py", line 443, in synchronize
[1,1]<stderr>:    mpi_lib.horovod_torch_wait_and_clear(handle)
[1,1]<stderr>:RuntimeError: Mismatched data types: One rank had type int64, but another rank had type uint8.
[1,0]<stderr>:Traceback (most recent call last):
[1,0]<stderr>:  File "pretrain.py", line 621, in <module>
[1,0]<stderr>:    main(args)
[1,0]<stderr>:  File "pretrain.py", line 372, in main
[1,0]<stderr>:    validate(model, val_dataloaders, opts)
[1,0]<stderr>:  File "pretrain.py", line 403, in validate
[1,0]<stderr>:    val_log = validate_vsm(model, loader, opts)
[1,0]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 49, in decorate_no_grad
[1,0]<stderr>:    return func(*args, **kwargs)
[1,0]<stderr>:  File "pretrain.py", line 427, in validate_vsm
[1,0]<stderr>:    model(batch, 'vsm', compute_loss=True)
[1,0]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 545, in __call__
[1,0]<stderr>:    result = self.forward(*input, **kwargs)
[1,0]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/apex/amp/_initialize.py", line 194, in new_fwd
[1,0]<stderr>:    **applier(kwargs, input_caster))
[1,0]<stderr>:  File "/mnt/workspace/wangqifan.wqf/antmmf/hero-ex/model/pretrain.py", line 84, in forward
[1,0]<stderr>:    batch['c_attn_masks'])
[1,0]<stderr>:  File "/mnt/workspace/wangqifan.wqf/antmmf/hero-ex/model/pretrain.py", line 400, in get_video_level_scores
[1,0]<stderr>:    modularized_query = vsm_allgather(modularized_query).contiguous()
[1,0]<stderr>:  File "/mnt/workspace/wangqifan.wqf/antmmf/hero-ex/model/pretrain.py", line 452, in vsm_allgather
[1,0]<stderr>:    return VsmAllgather.apply(tensor, None)
[1,0]<stderr>:  File "/mnt/workspace/wangqifan.wqf/antmmf/hero-ex/model/pretrain.py", line 437, in forward
[1,0]<stderr>:    torch.tensor([ctx.dim], device=tensor.device)
[1,0]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/horovod/torch/mpi_ops.py", line 287, in allgather
[1,0]<stderr>:    return HorovodAllgather.apply(tensor, name)
[1,0]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/horovod/torch/mpi_ops.py", line 250, in forward
[1,0]<stderr>:    return synchronize(handle)
[1,0]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/horovod/torch/mpi_ops.py", line 443, in synchronize
[1,0]<stderr>:    mpi_lib.horovod_torch_wait_and_clear(handle)
[1,0]<stderr>:RuntimeError: Mismatched data types: One rank had type int64, but another rank had type uint8.
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[28050,1],1]
  Exit code:    1
-------------------------------------------------------------------------- 
linjieli222 commented 2 years ago

Hi there,

Thanks for your interests in our HERO project. We did not run into this issue during our experiments. I see you are using your own virtual environment, which might be where the discrepancies come from in your experiment.

From a first glance, there is one tensor that has different data type across ranks (unit 8 vs. int64). My suggestion is to find out which tensor it is exactly from "model/pretrain.py, line 452" and cast it into the same data type across all ranks.

linjieli222 commented 2 years ago

Closed due to inactivity