mlcommons / training

Reference implementations of MLPerf™ training benchmarks
https://mlcommons.org/en/groups/training
Apache License 2.0
1.62k stars 560 forks source link

TorchRec DLRM Failed to initialize NumPy: _ARRAY_API not found #760

Open rvernica opened 3 months ago

rvernica commented 3 months ago

Unable to run TorchRec DLRM using the provided Dockerfile and requirements.txt. I'm using the latest revision of the master branch.

> cp Dockerfile Dockerfile.torchx
> torchx run -s local_docker dist.ddp -j 1x2 --script dlrm_main.py
torchx 2024-08-05 13:26:15 INFO     Tracker configurations: {}
torchx 2024-08-05 13:26:15 INFO     Checking for changes in workspace `file:///proj/java-gpu/training/recommendation_v2/torchrec_dlrm`...
torchx 2024-08-05 13:26:15 INFO     To disable workspaces pass: --workspace="" from CLI or workspace=None programmatically.
torchx 2024-08-05 13:26:15 INFO     Workspace `file:///proj/java-gpu/training/recommendation_v2/torchrec_dlrm` resolved to filesystem path `/proj/java-gpu/training/recommendation_v2/torchrec_dlrm`
torchx 2024-08-05 13:26:16 INFO     Building workspace docker image (this may take a while)...
torchx 2024-08-05 13:26:16 INFO     Step 1/7 : ARG FROM_IMAGE_NAME=pytorch/pytorch:1.13.1-cuda11.6-cudnn8-runtime
torchx 2024-08-05 13:26:16 INFO     Step 2/7 : FROM ${FROM_IMAGE_NAME}
torchx 2024-08-05 13:26:16 INFO      ---> 71eb2d092138
torchx 2024-08-05 13:26:16 INFO     Step 3/7 : RUN apt-get -y update &&     apt-get -y install git
torchx 2024-08-05 13:26:16 INFO      ---> Using cache
torchx 2024-08-05 13:26:16 INFO      ---> 45eded198de2
torchx 2024-08-05 13:26:16 INFO     Step 4/7 : WORKDIR /workspace/torchrec_dlrm
torchx 2024-08-05 13:26:16 INFO      ---> Using cache
torchx 2024-08-05 13:26:16 INFO      ---> 1b41a30dcd79
torchx 2024-08-05 13:26:16 INFO     Step 5/7 : COPY . .
torchx 2024-08-05 13:26:16 INFO      ---> ae30b5f5e5a1
torchx 2024-08-05 13:26:16 INFO     Step 6/7 : RUN pip install --no-cache-dir -r requirements.txt
torchx 2024-08-05 13:26:16 INFO      ---> Running in 3ef0c644fc38
...
torchx 2024-08-05 13:27:02 INFO      ---> Removed intermediate container 3ef0c644fc38
torchx 2024-08-05 13:27:02 INFO      ---> addfe3ce01cb
torchx 2024-08-05 13:27:02 INFO     Step 7/7 : LABEL torchx.pytorch.org/version=0.7.0
torchx 2024-08-05 13:27:02 INFO      ---> Running in 4e254643ce54
torchx 2024-08-05 13:27:02 INFO      ---> Removed intermediate container 4e254643ce54
torchx 2024-08-05 13:27:02 INFO      ---> 861ee2a4e5d3
torchx 2024-08-05 13:27:02 INFO     [Warning] One or more build-args [IMAGE WORKSPACE] were not consumed
torchx 2024-08-05 13:27:02 INFO     Successfully built 861ee2a4e5d3
torchx 2024-08-05 13:27:02 INFO     Built new image `sha256:861ee2a4e5d33dca93d9fe8847feccd4028d2e27c8f281654307aeec203452bd` based on original image `ghcr.io/pytorch/torchx:0.7.0` and changes in workspace `file:///proj/java-gpu/training/recommendation_v2/torchrec_dlrm` for role[0]=dlrm_main.
local_docker://torchx/dlrm_main-sbz7tbpcb2sqvd
torchx 2024-08-05 13:27:03 INFO     Waiting for the app to finish...
dlrm_main/0 WARNING:torch.distributed.run:
dlrm_main/0 *****************************************
dlrm_main/0 Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
dlrm_main/0 *****************************************
dlrm_main/0 [0]:
dlrm_main/0 [0]:A module that was compiled using NumPy 1.x cannot be run in
dlrm_main/0 [0]:NumPy 2.0.1 as it may crash. To support both 1.x and 2.x
dlrm_main/0 [0]:versions of NumPy, modules must be compiled with NumPy 2.0.
dlrm_main/0 [0]:Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.
dlrm_main/0 [0]:
dlrm_main/0 [0]:If you are a user of the module, the easiest solution will be to
dlrm_main/0 [0]:downgrade to 'numpy<2' or try to upgrade the affected module.
dlrm_main/0 [0]:We expect that some modules will need time to support NumPy 2.
dlrm_main/0 [0]:
dlrm_main/0 [0]:Traceback (most recent call last):  File "/workspace/torchrec_dlrm/dlrm_main.py", line 19, in <module>
dlrm_main/0 [0]:    import torchmetrics as metrics
dlrm_main/0 [0]:  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/__init__.py", line 14, in <module>
dlrm_main/0 [0]:    from torchmetrics import functional  # noqa: E402
dlrm_main/0 [0]:  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/functional/__init__.py", line 14, in <module>
dlrm_main/0 [0]:    from torchmetrics.functional.audio.pit import permutation_invariant_training, pit_permutate
dlrm_main/0 [0]:  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/functional/audio/__init__.py", line 14, in <module>
dlrm_main/0 [0]:    from torchmetrics.functional.audio.pit import permutation_invariant_training, pit_permutate  # noqa: F401
dlrm_main/0 [0]:  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/functional/audio/pit.py", line 21, in <module>
dlrm_main/0 [0]:    from torchmetrics.utilities.imports import _SCIPY_AVAILABLE
dlrm_main/0 [0]:  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/utilities/__init__.py", line 1, in <module>
dlrm_main/0 [0]:    from torchmetrics.utilities.checks import check_forward_full_state_property  # noqa: F401
dlrm_main/0 [0]:  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/utilities/checks.py", line 22, in <module>
dlrm_main/0 [0]:    from torchmetrics.utilities.data import select_topk, to_onehot
dlrm_main/0 [0]:  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/utilities/data.py", line 19, in <module>
dlrm_main/0 [0]:    from torchmetrics.utilities.imports import _TORCH_GREATER_EQUAL_1_12
dlrm_main/0 [0]:  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/utilities/imports.py", line 113, in <module>
dlrm_main/0 [0]:    _TORCHVISION_GREATER_EQUAL_0_8: Optional[bool] = _compare_version("torchvision", operator.ge, "0.8.0")
dlrm_main/0 [0]:  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/utilities/imports.py", line 79, in _compare_version
dlrm_main/0 [0]:    if not _module_available(package):
dlrm_main/0 [0]:  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/utilities/imports.py", line 60, in _module_available
dlrm_main/0 [0]:    module = import_module(module_names[0])
dlrm_main/0 [0]:  File "/opt/conda/lib/python3.10/importlib/__init__.py", line 126, in import_module
dlrm_main/0 [0]:    return _bootstrap._gcd_import(name[level:], package, level)
dlrm_main/0 [0]:  File "/opt/conda/lib/python3.10/site-packages/torchvision/__init__.py", line 5, in <module>
dlrm_main/0 [0]:    from torchvision import datasets, io, models, ops, transforms, utils
dlrm_main/0 [0]:  File "/opt/conda/lib/python3.10/site-packages/torchvision/models/__init__.py", line 17, in <module>
dlrm_main/0 [0]:    from . import detection, optical_flow, quantization, segmentation, video
dlrm_main/0 [0]:  File "/opt/conda/lib/python3.10/site-packages/torchvision/models/detection/__init__.py", line 1, in <module>
dlrm_main/0 [0]:    from .faster_rcnn import *
dlrm_main/0 [0]:  File "/opt/conda/lib/python3.10/site-packages/torchvision/models/detection/faster_rcnn.py", line 16, in <module>
dlrm_main/0 [0]:    from .anchor_utils import AnchorGenerator
dlrm_main/0 [0]:  File "/opt/conda/lib/python3.10/site-packages/torchvision/models/detection/anchor_utils.py", line 10, in <module>
dlrm_main/0 [0]:    class AnchorGenerator(nn.Module):
dlrm_main/0 [0]:  File "/opt/conda/lib/python3.10/site-packages/torchvision/models/detection/anchor_utils.py", line 63, in AnchorGenerator
dlrm_main/0 [0]:    device: torch.device = torch.device("cpu"),
dlrm_main/0 [0]:/opt/conda/lib/python3.10/site-packages/torchvision/models/detection/anchor_utils.py:63: UserWarning: Failed to initialize NumPy: _ARRAY_API not found (Triggered internally at /opt/conda/conda-bld/pytorch_1670525552843/work/torch/csrc/utils/tensor_numpy.cpp:77.)
dlrm_main/0 [0]:  device: torch.device = torch.device("cpu"),
dlrm_main/0 [1]:
dlrm_main/0 [1]:A module that was compiled using NumPy 1.x cannot be run in
dlrm_main/0 [1]:NumPy 2.0.1 as it may crash. To support both 1.x and 2.x
dlrm_main/0 [1]:versions of NumPy, modules must be compiled with NumPy 2.0.
dlrm_main/0 [1]:Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.
dlrm_main/0 [1]:
dlrm_main/0 [1]:If you are a user of the module, the easiest solution will be to
dlrm_main/0 [1]:downgrade to 'numpy<2' or try to upgrade the affected module.
dlrm_main/0 [1]:We expect that some modules will need time to support NumPy 2.
dlrm_main/0 [1]:
dlrm_main/0 [1]:Traceback (most recent call last):  File "/workspace/torchrec_dlrm/dlrm_main.py", line 19, in <module>
dlrm_main/0 [1]:    import torchmetrics as metrics
dlrm_main/0 [1]:  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/__init__.py", line 14, in <module>
dlrm_main/0 [1]:    from torchmetrics import functional  # noqa: E402
dlrm_main/0 [1]:  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/functional/__init__.py", line 14, in <module>
dlrm_main/0 [1]:    from torchmetrics.functional.audio.pit import permutation_invariant_training, pit_permutate
dlrm_main/0 [1]:  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/functional/audio/__init__.py", line 14, in <module>
dlrm_main/0 [1]:    from torchmetrics.functional.audio.pit import permutation_invariant_training, pit_permutate  # noqa: F401
dlrm_main/0 [1]:  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/functional/audio/pit.py", line 21, in <module>
dlrm_main/0 [1]:    from torchmetrics.utilities.imports import _SCIPY_AVAILABLE
dlrm_main/0 [1]:  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/utilities/__init__.py", line 1, in <module>
dlrm_main/0 [1]:    from torchmetrics.utilities.checks import check_forward_full_state_property  # noqa: F401
dlrm_main/0 [1]:  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/utilities/checks.py", line 22, in <module>
dlrm_main/0 [1]:    from torchmetrics.utilities.data import select_topk, to_onehot
dlrm_main/0 [1]:  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/utilities/data.py", line 19, in <module>
dlrm_main/0 [1]:    from torchmetrics.utilities.imports import _TORCH_GREATER_EQUAL_1_12
dlrm_main/0 [1]:  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/utilities/imports.py", line 113, in <module>
dlrm_main/0 [1]:    _TORCHVISION_GREATER_EQUAL_0_8: Optional[bool] = _compare_version("torchvision", operator.ge, "0.8.0")
dlrm_main/0 [1]:  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/utilities/imports.py", line 79, in _compare_version
dlrm_main/0 [1]:    if not _module_available(package):
dlrm_main/0 [1]:  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/utilities/imports.py", line 60, in _module_available
dlrm_main/0 [1]:    module = import_module(module_names[0])
dlrm_main/0 [1]:  File "/opt/conda/lib/python3.10/importlib/__init__.py", line 126, in import_module
dlrm_main/0 [1]:    return _bootstrap._gcd_import(name[level:], package, level)
dlrm_main/0 [1]:  File "/opt/conda/lib/python3.10/site-packages/torchvision/__init__.py", line 5, in <module>
dlrm_main/0 [1]:    from torchvision import datasets, io, models, ops, transforms, utils
dlrm_main/0 [1]:  File "/opt/conda/lib/python3.10/site-packages/torchvision/models/__init__.py", line 17, in <module>
dlrm_main/0 [1]:    from . import detection, optical_flow, quantization, segmentation, video
dlrm_main/0 [1]:  File "/opt/conda/lib/python3.10/site-packages/torchvision/models/detection/__init__.py", line 1, in <module>
dlrm_main/0 [1]:    from .faster_rcnn import *
dlrm_main/0 [1]:  File "/opt/conda/lib/python3.10/site-packages/torchvision/models/detection/faster_rcnn.py", line 16, in <module>
dlrm_main/0 [1]:    from .anchor_utils import AnchorGenerator
dlrm_main/0 [1]:  File "/opt/conda/lib/python3.10/site-packages/torchvision/models/detection/anchor_utils.py", line 10, in <module>
dlrm_main/0 [1]:    class AnchorGenerator(nn.Module):
dlrm_main/0 [1]:  File "/opt/conda/lib/python3.10/site-packages/torchvision/models/detection/anchor_utils.py", line 63, in AnchorGenerator
dlrm_main/0 [1]:    device: torch.device = torch.device("cpu"),
dlrm_main/0 [1]:/opt/conda/lib/python3.10/site-packages/torchvision/models/detection/anchor_utils.py:63: UserWarning: Failed to initialize NumPy: _ARRAY_API not found (Triggered internally at /opt/conda/conda-bld/pytorch_1670525552843/work/torch/csrc/utils/tensor_numpy.cpp:77.)
dlrm_main/0 [1]:  device: torch.device = torch.device("cpu"),
dlrm_main/0 [1]:Traceback (most recent call last):
dlrm_main/0 [1]:  File "/workspace/torchrec_dlrm/dlrm_main.py", line 939, in <module>
dlrm_main/0 [1]:    main(sys.argv[1:])
dlrm_main/0 [1]:  File "/workspace/torchrec_dlrm/dlrm_main.py", line 813, in main
dlrm_main/0 [1]:    plan = planner.collective_plan(
dlrm_main/0 [1]:  File "/opt/conda/lib/python3.10/site-packages/torchrec/distributed/planner/planners.py", line 177, in collective_plan
dlrm_main/0 [1]:    return invoke_on_rank_and_broadcast_result(
dlrm_main/0 [1]:  File "/opt/conda/lib/python3.10/site-packages/torchrec/distributed/collective_utils.py", line 58, in invoke_on_rank_and_broadcast_result
dlrm_main/0 [1]:    dist.broadcast_object_list(object_list, rank, group=pg)
dlrm_main/0 [1]:  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2106, in broadcast_object_list
dlrm_main/0 [1]:    object_list[i] = _tensor_to_object(obj_view, obj_size)
dlrm_main/0 [1]:  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1803, in _tensor_to_object
dlrm_main/0 [1]:    buf = tensor.numpy().tobytes()[:tensor_size]
dlrm_main/0 [1]:RuntimeError: Numpy is not available
dlrm_main/0 [0]:Traceback (most recent call last):
dlrm_main/0 [0]:  File "/workspace/torchrec_dlrm/dlrm_main.py", line 939, in <module>
dlrm_main/0 [0]:    main(sys.argv[1:])
dlrm_main/0 [0]:  File "/workspace/torchrec_dlrm/dlrm_main.py", line 817, in main
dlrm_main/0 [0]:    model = DistributedModelParallel(
dlrm_main/0 [0]:  File "/opt/conda/lib/python3.10/site-packages/torchrec/distributed/model_parallel.py", line 232, in __init__
dlrm_main/0 [0]:    self.init_data_parallel()
dlrm_main/0 [0]:  File "/opt/conda/lib/python3.10/site-packages/torchrec/distributed/model_parallel.py", line 266, in init_data_parallel
dlrm_main/0 [0]:    self._data_parallel_wrapper.wrap(self, self._env, self.device)
dlrm_main/0 [0]:  File "/opt/conda/lib/python3.10/site-packages/torchrec/distributed/model_parallel.py", line 97, in wrap
dlrm_main/0 [0]:    DistributedDataParallel(
dlrm_main/0 [0]:  File "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 655, in __init__
dlrm_main/0 [0]:    _verify_param_shape_across_processes(self.process_group, parameters)
dlrm_main/0 [0]:  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/utils.py", line 112, in _verify_param_shape_across_processes
dlrm_main/0 [0]:    return dist._verify_params_across_processes(process_group, tensors, logger)
dlrm_main/0 [0]:RuntimeError: [/opt/conda/conda-bld/pytorch_1670525552843/work/third_party/gloo/gloo/transport/tcp/pair.cc:598] Connection closed by peer [172.20.0.2]:54499
dlrm_main/0 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 25) of binary: /opt/conda/bin/python
dlrm_main/0 [0]:libcuda.so.1: cannot open shared object file: No such file or directory
dlrm_main/0 [1]:libcuda.so.1: cannot open shared object file: No such file or directory
dlrm_main/0 [0]::::MLLOG {"namespace": "", "time_ms": 1722889625323, "event_type": "POINT_IN_TIME", "key": "cache_clear", "value": true, "metadata": {"file": "/workspace/torchrec_dlrm/dlrm_main.py", "lineno": 660}}
dlrm_main/0 [0]::::MLLOG {"namespace": "", "time_ms": 1722889625376, "event_type": "INTERVAL_START", "key": "init_start", "value": null, "metadata": {"file": "/workspace/torchrec_dlrm/dlrm_main.py", "lineno": 661}}
dlrm_main/0 [1]::::MLLOG {"namespace": "", "time_ms": 1722889625323, "event_type": "POINT_IN_TIME", "key": "cache_clear", "value": true, "metadata": {"file": "/workspace/torchrec_dlrm/dlrm_main.py", "lineno": 660}}
dlrm_main/0 [1]::::MLLOG {"namespace": "", "time_ms": 1722889625376, "event_type": "INTERVAL_START", "key": "init_start", "value": null, "metadata": {"file": "/workspace/torchrec_dlrm/dlrm_main.py", "lineno": 661}}
dlrm_main/0 [0]:{'adagrad': False,
dlrm_main/0 [0]: 'allow_tf32': False,
dlrm_main/0 [0]: 'batch_size': 32,
dlrm_main/0 [0]: 'collect_multi_hot_freqs_stats': False,
dlrm_main/0 [0]: 'dataset_name': 'criteo_1t',
dlrm_main/0 [0]: 'dcn_low_rank_dim': 512,
dlrm_main/0 Traceback (most recent call last):
dlrm_main/0   File "/opt/conda/bin/torchrun", line 33, in <module>
dlrm_main/0     sys.exit(load_entry_point('torch==1.13.1', 'console_scripts', 'torchrun')())
dlrm_main/0   File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
dlrm_main/0 [0]: 'dcn_num_layers': 3,
dlrm_main/0 [0]: 'dense_arch_layer_sizes': [512, 256, 64],
dlrm_main/0 [0]: 'drop_last_training_batch': False,
dlrm_main/0 [0]: 'embedding_dim': 64,
dlrm_main/0 [0]: 'epochs': 1,
dlrm_main/0 [0]: 'evaluate_on_epoch_end': False,
dlrm_main/0 [0]: 'evaluate_on_training_end': False,
dlrm_main/0 [0]: 'in_memory_binary_criteo_path': None,
dlrm_main/0 [0]: 'interaction_branch1_layer_sizes': [2048, 2048],
dlrm_main/0 [0]: 'interaction_branch2_layer_sizes': [2048, 2048],
dlrm_main/0 [0]: 'interaction_type': <InteractionType.ORIGINAL: 'original'>,
dlrm_main/0 [0]: 'learning_rate': 15.0,
dlrm_main/0 [0]: 'limit_test_batches': None,
dlrm_main/0 [0]: 'limit_train_batches': None,
dlrm_main/0 [0]: 'limit_val_batches': None,
dlrm_main/0 [0]: 'lr_decay_start': 0,
dlrm_main/0 [0]: 'lr_decay_steps': 0,
dlrm_main/0 [0]: 'lr_warmup_steps': 0,
dlrm_main/0 [0]: 'mmap_mode': False,
dlrm_main/0 [0]: 'multi_hot_distribution_type': None,
dlrm_main/0 [0]: 'multi_hot_sizes': None,
dlrm_main/0 [0]: 'num_embeddings': 100000,
dlrm_main/0 [0]: 'num_embeddings_per_feature': None,
dlrm_main/0 [0]: 'over_arch_layer_sizes': [512, 512, 256, 1],
dlrm_main/0 [0]: 'pin_memory': False,
dlrm_main/0 [0]: 'print_lr': False,
dlrm_main/0 [0]: 'print_progress': False,
dlrm_main/0 [0]: 'print_sharding_plan': False,
dlrm_main/0 [0]: 'seed': None,
dlrm_main/0 [0]: 'shuffle_batches': False,
dlrm_main/0 [0]: 'shuffle_training_set': False,
dlrm_main/0 [0]: 'synthetic_multi_hot_criteo_path': None,
dlrm_main/0 [0]: 'test_batch_size': None,
dlrm_main/0 [0]: 'validation_auroc': None,
dlrm_main/0 [0]: 'validation_freq_within_epoch': None}
dlrm_main/0 [0]::::MLLOG {"namespace": "", "time_ms": 1722889626392, "event_type": "POINT_IN_TIME", "key": "submission_benchmark", "value": "dlrm_dcnv2", "metadata": {"file": "/workspace/torchrec_dlrm/mlperf_logging_utils.py", "lineno": 7}}
dlrm_main/0 [0]::::MLLOG {"namespace": "", "time_ms": 1722889626392, "event_type": "POINT_IN_TIME", "key": "submission_org", "value": "reference_implementation", "metadata": {"file": "/workspace/torchrec_dlrm/mlperf_logging_utils.py", "lineno": 11}}
dlrm_main/0 [0]::::MLLOG {"namespace": "", "time_ms": 1722889626392, "event_type": "POINT_IN_TIME", "key": "submission_division", "value": "closed", "metadata": {"file": "/workspace/torchrec_dlrm/mlperf_logging_utils.py", "lineno": 15}}
dlrm_main/0 [0]::::MLLOG {"namespace": "", "time_ms": 1722889626392, "event_type": "POINT_IN_TIME", "key": "submission_status", "value": "onprem", "metadata": {"file": "/workspace/torchrec_dlrm/mlperf_logging_utils.py", "lineno": 19}}
dlrm_main/0 [0]::::MLLOG {"namespace": "", "time_ms": 1722889626393, "event_type": "POINT_IN_TIME", "key": "submission_platform", "value": "reference_implementation", "metadata": {"file": "/workspace/torchrec_dlrm/mlperf_logging_utils.py", "lineno": 23}}
dlrm_main/0 [0]::::MLLOG {"namespace": "", "time_ms": 1722889626393, "event_type": "POINT_IN_TIME", "key": "global_batch_size", "value": 64, "metadata": {"file": "/workspace/torchrec_dlrm/dlrm_main.py", "lineno": 705}}
dlrm_main/0 [0]::::MLLOG {"namespace": "", "time_ms": 1722889626393, "event_type": "POINT_IN_TIME", "key": "gradient_accumulation_steps", "value": 1, "metadata": {"file": "/workspace/torchrec_dlrm/dlrm_main.py", "lineno": 709}}
dlrm_main/0     return f(*args, **kwargs)
dlrm_main/0   File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in main
dlrm_main/0 [0]::::MLLOG {"namespace": "", "time_ms": 1722889626393, "event_type": "POINT_IN_TIME", "key": "seed", "value": null, "metadata": {"file": "/workspace/torchrec_dlrm/dlrm_main.py", "lineno": 713}}
dlrm_main/0     run(args)
dlrm_main/0   File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run
dlrm_main/0     elastic_launch(
dlrm_main/0   File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
dlrm_main/0     return launch_agent(self._config, self._entrypoint, list(args))
dlrm_main/0   File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
dlrm_main/0     raise ChildFailedError(
dlrm_main/0 torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
dlrm_main/0 ============================================================
dlrm_main/0 dlrm_main.py FAILED
dlrm_main/0 ------------------------------------------------------------
dlrm_main/0 Failures:
dlrm_main/0 [1]:
dlrm_main/0   time      : 2024-08-05_20:27:09
dlrm_main/0   host      : dlrm_main-sbz7tbpcb2sqvd-dlrm_main-0
dlrm_main/0   rank      : 1 (local_rank: 1)
dlrm_main/0   exitcode  : 1 (pid: 26)
dlrm_main/0   error_file: <N/A>
dlrm_main/0   traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
dlrm_main/0 ------------------------------------------------------------
dlrm_main/0 Root Cause (first observed failure):
dlrm_main/0 [0]:
dlrm_main/0   time      : 2024-08-05_20:27:09
dlrm_main/0   host      : dlrm_main-sbz7tbpcb2sqvd-dlrm_main-0
dlrm_main/0   rank      : 0 (local_rank: 0)
dlrm_main/0   exitcode  : 1 (pid: 25)
dlrm_main/0   error_file: <N/A>
dlrm_main/0   traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
dlrm_main/0 ============================================================
torchx 2024-08-05 13:27:10 INFO     Job finished: FAILED
torchx 2024-08-05 13:27:10 ERROR    AppStatus:
  msg: <NONE>
  num_restarts: -1
  roles:
  - replicas:
    - hostname: dlrm_main-sbz7tbpcb2sqvd-dlrm_main-0
      id: 0
      role: dlrm_main
      state: !!python/object/apply:torchx.specs.api.AppState
      - 5
      structured_error_msg: <NONE>
    role: dlrm_main
  state: FAILED (5)
  structured_error_msg: <NONE>
  ui_url: null