Closed Unified-Robots closed 3 years ago
Hi,
Thanks for your interests in this project. During our experiments, we have observed the same error a few times. However, the error is not reproducible. When we repeat the exact experiment, the training runs smoothly. As the appearance of this error is rare and random at our end, we do not have any insight on why it was happening. You can also look into horovod repo, they may provide a more insightful explanation.
A quick fix is to run your experiments with 1 gpu. However, it may results in lower performance and slower speed.
@linjieli222 OK, thanks for your reply. This problem may be related with our hardware. I run the experiments on single gpu, and it is fine. I have another request. Could you provide us a subset of data for pre-training? The code is somewhat complicated...
You can use TV dataset to conduct a relative "smaller" pre-training experiment.
Closed due to inactivity.
I only have two gpu in my computer. When I use the command " horovodrun -np 2 python train_vcmr.py --config config/train-tvr-8gpu.json', the code meets deadlock. I attach the output here. How to solve this issue?
[1,0]:02/01/2021 22:11:30 - INFO - main - Loading tvr train dataset /video/tv
[1,0]:02/01/2021 22:11:33 - INFO - main - 87153 samples loaded
[1,1]:
[1,0]:02/01/2021 22:11:33 - INFO - main - Loading tvr validation dataset/video/tv
[1,0]:02/01/2021 22:11:33 - INFO - main - 10895 samples loaded
[1,0]:02/01/2021 22:11:33 - INFO - main - Loading Inference Dataset /txt/tvr_val.db (val)
[1,0]:02/01/2021 22:11:34 - INFO - model.model - Model config:
[1,0]:02/01/2021 22:11:34 - INFO - model.model - Cross-Modal Transformer config: {
[1,0]: "attention_probs_dropout_prob": 0.1,
[1,0]: "hidden_act": "gelu",
[1,0]: "hidden_dropout_prob": 0.1,
[1,0]: "hidden_size": 768,
[1,0]: "initializer_range": 0.02,
[1,0]: "intermediate_size": 3072,
[1,0]: "layer_norm_eps": 1e-12,
[1,0]: "max_position_embeddings": 514,
[1,0]: "num_attention_heads": 12,
[1,0]: "num_hidden_layers": 6,
[1,0]: "output_attentions": false,
[1,0]: "output_hidden_states": false,
[1,0]: "type_vocab_size": 2,
[1,0]: "vocab_size": 50272
[1,0]:}
[1,0]:
[1,0]:02/01/2021 22:11:34 - INFO - model.model - Temporal Transformer config: {
[1,0]: "attention_probs_dropout_prob": 0.1,
[1,0]: "hidden_act": "gelu",
[1,0]: "hidden_dropout_prob": 0.1,
[1,0]: "hidden_size": 768,
[1,0]: "initializer_range": 0.02,
[1,0]: "intermediate_size": 3072,
[1,0]: "layer_norm_eps": 1e-12,
[1,0]: "max_position_embeddings": 514,
[1,0]: "num_attention_heads": 12,
[1,0]: "num_hidden_layers": 3,
[1,0]: "output_attentions": false,
[1,0]: "output_hidden_states": false,
[1,0]: "type_vocab_size": 2,
[1,0]: "vocab_size": -1
[1,0]:}
[1,0]:
[1,0]:02/01/2021 22:11:34 - INFO - model.model - QueryEncoder config: {
[1,0]: "attention_probs_dropout_prob": 0.1,
[1,0]: "hidden_act": "gelu",
[1,0]: "hidden_dropout_prob": 0.1,
[1,0]: "hidden_size": 768,
[1,0]: "initializer_range": 0.02,
[1,0]: "intermediate_size": 3072,
[1,0]: "layer_norm_eps": 1e-12,
[1,0]: "max_position_embeddings": 514,
[1,0]: "num_attention_heads": 12,
[1,0]: "num_hidden_layers": 0,
[1,0]: "output_attentions": false,
[1,0]: "output_hidden_states": false,
[1,0]: "type_vocab_size": 1,
[1,0]: "vocab_size": 50272
[1,0]:}
[1,0]:
[1,0]:02/01/2021 22:11:34 - INFO - model.model - Decoder Transformer config: None
[1,1]:02/01/2021 22:11:34 - INFO - model.model - Model config:
[1,1]:02/01/2021 22:11:34 - INFO - model.model - Cross-Modal Transformer config: {
[1,1]: "attention_probs_dropout_prob": 0.1,
[1,1]: "hidden_act": "gelu",
[1,1]: "hidden_dropout_prob": 0.1,
[1,1]: "hidden_size": 768,
[1,1]: "initializer_range": 0.02,
[1,1]: "intermediate_size": 3072,
[1,1]: "layer_norm_eps": 1e-12,
[1,1]: "max_position_embeddings": 514,
[1,1]: "num_attention_heads": 12,
[1,1]: "num_hidden_layers": 6,
[1,1]: "output_attentions": false,
[1,1]: "output_hidden_states": false,
[1,1]: "type_vocab_size": 2,
[1,1]: "vocab_size": 50272
[1,1]:}
[1,1]:
[1,1]:02/01/2021 22:11:34 - INFO - model.model - Temporal Transformer config: {
[1,1]: "attention_probs_dropout_prob": 0.1,
[1,1]: "hidden_act": "gelu",
[1,1]: "hidden_dropout_prob": 0.1,
[1,1]: "hidden_size": 768,
[1,1]: "initializer_range": 0.02,
[1,1]: "intermediate_size": 3072,
[1,1]: "layer_norm_eps": 1e-12,
[1,1]: "max_position_embeddings": 514,
[1,1]: "num_attention_heads": 12,
[1,1]: "num_hidden_layers": 3,
[1,1]: "output_attentions": false,
[1,1]: "output_hidden_states": false,
[1,1]: "type_vocab_size": 2,
[1,1]: "vocab_size": -1
[1,1]:}
[1,1]:
[1,1]:02/01/2021 22:11:34 - INFO - model.model - QueryEncoder config: {
[1,1]: "attention_probs_dropout_prob": 0.1,
[1,1]: "hidden_act": "gelu",
[1,1]: "hidden_dropout_prob": 0.1,
[1,1]: "hidden_size": 768,
[1,1]: "initializer_range": 0.02,
[1,1]: "intermediate_size": 3072,
[1,1]: "layer_norm_eps": 1e-12,
[1,1]: "max_position_embeddings": 514,
[1,1]: "num_attention_heads": 12,
[1,1]: "num_hidden_layers": 0,
[1,1]: "output_attentions": false,
[1,1]: "output_hidden_states": false,
[1,1]: "type_vocab_size": 1,
[1,1]: "vocab_size": 50272
[1,1]:}
[1,1]:
[1,1]:02/01/2021 22:11:34 - INFO - model.model - Decoder Transformer config: None
[1,1]:02/01/2021 22:11:45 - INFO - model.modeling_utils - Weights from pretrained model not used in HeroForVcmr: ['vocab_padded']
[1,0]:02/01/2021 22:11:45 - INFO - model.modeling_utils - Weights from pretrained model not used in HeroForVcmr: ['vocab_padded']
[1,1]:Selected optimization level O2: FP16 training with FP32 batchnorm and FP32 master weights.
[1,1]:
[1,1]:Defaults for this optimization level are:
[1,1]:enabled : True
[1,1]:opt_level : O2
[1,1]:cast_model_type : torch.float16
[1,1]:patch_torch_functions : False
[1,1]:keep_batchnorm_fp32 : True
[1,1]:master_weights : True
[1,1]:loss_scale : dynamic
[1,1]:Processing user overrides (additional kwargs that are not None)...
[1,1]:After processing overrides, optimization options are:
[1,1]:enabled : True
[1,1]:opt_level : O2
[1,1]:cast_model_type : torch.float16
[1,1]:patch_torch_functions : False
[1,1]:keep_batchnorm_fp32 : True
[1,1]:master_weights : True
[1,1]:loss_scale : dynamic
[1,0]:Selected optimization level O2: FP16 training with FP32 batchnorm and FP32 master weights.
[1,0]:
[1,0]:Defaults for this optimization level are:
[1,0]:enabled : True
[1,0]:opt_level : O2
[1,0]:cast_model_type : torch.float16
[1,0]:patch_torch_functions : False
[1,0]:keep_batchnorm_fp32 : True
[1,0]:master_weights : True
[1,0]:loss_scale : dynamic
[1,0]:Processing user overrides (additional kwargs that are not None)...
[1,0]:After processing overrides, optimization options are:
[1,0]:enabled : True
[1,0]:opt_level : O2
[1,0]:cast_model_type : torch.float16
[1,0]:patch_torch_functions : False
[1,0]:keep_batchnorm_fp32 : True
[1,0]:master_weights : True
[1,0]:loss_scale : dynamic
[1,1]:restorer is finished
[1,0]:restorer is finished
[1,0]:02/01/2021 22:11:46 - INFO - main - Waiting on git info....
[1,0]:fatal: not a git repository (or any parent up to mount point /)
[1,0]:Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
[1,0]:02/01/2021 22:11:46 - INFO - main - Git branch:
[1,0]:fatal: not a git repository (or any parent up to mount point /)
[1,0]:Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
[1,0]:02/01/2021 22:11:46 - INFO - main - Git SHA:
[1,0]:fatal: not a git repository (or any parent up to mount point /)
[1,0]:Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
[1,0]:02/01/2021 22:11:47 - ERROR - main - Command '['git', 'status', '--short']' returned non-zero exit status 128.
[1,0]:Traceback (most recent call last):
[1,0]: File "/src/utils/save.py", line 51, in save_training_meta
[1,0]: cwd=git_dir, universal_newlines=True).strip()
[1,0]: File "/opt/conda/lib/python3.6/subprocess.py", line 356, in check_output
[1,0]: **kwargs).stdout
[1,0]: File "/opt/conda/lib/python3.6/subprocess.py", line 438, in run
[1,0]: output=stdout, stderr=stderr)
[1,0]:subprocess.CalledProcessError: Command '['git', 'status', '--short']' returned non-zero exit status 128.
[1,0]:02/01/2021 22:11:47 - WARNING - main - Git info not found. Saving code into zip instead...
[1,0]:02/01/2021 22:11:47 - INFO - main - Saving code from /src to /storage/tvr_default/code.zip...
[1,0]:[2021-02-01 22:13:23.215038: W horovod/common/stall_inspector.cc:105] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
[1,0]:Stalled ranks:
[1,0]:0: [allgather.noname.40]
[1,0]:[2021-02-01 22:14:23.216625: W horovod/common/stall_inspector.cc:105] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
[1,0]:Stalled ranks:
[1,0]:0: [allgather.noname.40][1,0]:
[1,0]:[2021-02-01 22:15:23.218786: W horovod/common/stall_inspector.cc:105] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
[1,0]:Stalled ranks:
[1,0]:0: [allgather.noname.40]
[1,0]:[2021-02-01 22:16:23.223750: W horovod/common/stall_inspector.cc:105] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
[1,0]:Stalled ranks:
[1,0]:0: [allgather.noname.40]