Open rvernica opened 3 months ago
Unable to run TorchRec DLRM using the provided Dockerfile and requirements.txt. I'm using the latest revision of the master branch.
> cp Dockerfile Dockerfile.torchx > torchx run -s local_docker dist.ddp -j 1x2 --script dlrm_main.py torchx 2024-08-05 13:26:15 INFO Tracker configurations: {} torchx 2024-08-05 13:26:15 INFO Checking for changes in workspace `file:///proj/java-gpu/training/recommendation_v2/torchrec_dlrm`... torchx 2024-08-05 13:26:15 INFO To disable workspaces pass: --workspace="" from CLI or workspace=None programmatically. torchx 2024-08-05 13:26:15 INFO Workspace `file:///proj/java-gpu/training/recommendation_v2/torchrec_dlrm` resolved to filesystem path `/proj/java-gpu/training/recommendation_v2/torchrec_dlrm` torchx 2024-08-05 13:26:16 INFO Building workspace docker image (this may take a while)... torchx 2024-08-05 13:26:16 INFO Step 1/7 : ARG FROM_IMAGE_NAME=pytorch/pytorch:1.13.1-cuda11.6-cudnn8-runtime torchx 2024-08-05 13:26:16 INFO Step 2/7 : FROM ${FROM_IMAGE_NAME} torchx 2024-08-05 13:26:16 INFO ---> 71eb2d092138 torchx 2024-08-05 13:26:16 INFO Step 3/7 : RUN apt-get -y update && apt-get -y install git torchx 2024-08-05 13:26:16 INFO ---> Using cache torchx 2024-08-05 13:26:16 INFO ---> 45eded198de2 torchx 2024-08-05 13:26:16 INFO Step 4/7 : WORKDIR /workspace/torchrec_dlrm torchx 2024-08-05 13:26:16 INFO ---> Using cache torchx 2024-08-05 13:26:16 INFO ---> 1b41a30dcd79 torchx 2024-08-05 13:26:16 INFO Step 5/7 : COPY . . torchx 2024-08-05 13:26:16 INFO ---> ae30b5f5e5a1 torchx 2024-08-05 13:26:16 INFO Step 6/7 : RUN pip install --no-cache-dir -r requirements.txt torchx 2024-08-05 13:26:16 INFO ---> Running in 3ef0c644fc38 ... torchx 2024-08-05 13:27:02 INFO ---> Removed intermediate container 3ef0c644fc38 torchx 2024-08-05 13:27:02 INFO ---> addfe3ce01cb torchx 2024-08-05 13:27:02 INFO Step 7/7 : LABEL torchx.pytorch.org/version=0.7.0 torchx 2024-08-05 13:27:02 INFO ---> Running in 4e254643ce54 torchx 2024-08-05 13:27:02 INFO ---> Removed intermediate container 4e254643ce54 torchx 2024-08-05 13:27:02 INFO ---> 861ee2a4e5d3 torchx 2024-08-05 13:27:02 INFO [Warning] One or more build-args [IMAGE WORKSPACE] were not consumed torchx 2024-08-05 13:27:02 INFO Successfully built 861ee2a4e5d3 torchx 2024-08-05 13:27:02 INFO Built new image `sha256:861ee2a4e5d33dca93d9fe8847feccd4028d2e27c8f281654307aeec203452bd` based on original image `ghcr.io/pytorch/torchx:0.7.0` and changes in workspace `file:///proj/java-gpu/training/recommendation_v2/torchrec_dlrm` for role[0]=dlrm_main. local_docker://torchx/dlrm_main-sbz7tbpcb2sqvd torchx 2024-08-05 13:27:03 INFO Waiting for the app to finish... dlrm_main/0 WARNING:torch.distributed.run: dlrm_main/0 ***************************************** dlrm_main/0 Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. dlrm_main/0 ***************************************** dlrm_main/0 [0]: dlrm_main/0 [0]:A module that was compiled using NumPy 1.x cannot be run in dlrm_main/0 [0]:NumPy 2.0.1 as it may crash. To support both 1.x and 2.x dlrm_main/0 [0]:versions of NumPy, modules must be compiled with NumPy 2.0. dlrm_main/0 [0]:Some module may need to rebuild instead e.g. with 'pybind11>=2.12'. dlrm_main/0 [0]: dlrm_main/0 [0]:If you are a user of the module, the easiest solution will be to dlrm_main/0 [0]:downgrade to 'numpy<2' or try to upgrade the affected module. dlrm_main/0 [0]:We expect that some modules will need time to support NumPy 2. dlrm_main/0 [0]: dlrm_main/0 [0]:Traceback (most recent call last): File "/workspace/torchrec_dlrm/dlrm_main.py", line 19, in <module> dlrm_main/0 [0]: import torchmetrics as metrics dlrm_main/0 [0]: File "/opt/conda/lib/python3.10/site-packages/torchmetrics/__init__.py", line 14, in <module> dlrm_main/0 [0]: from torchmetrics import functional # noqa: E402 dlrm_main/0 [0]: File "/opt/conda/lib/python3.10/site-packages/torchmetrics/functional/__init__.py", line 14, in <module> dlrm_main/0 [0]: from torchmetrics.functional.audio.pit import permutation_invariant_training, pit_permutate dlrm_main/0 [0]: File "/opt/conda/lib/python3.10/site-packages/torchmetrics/functional/audio/__init__.py", line 14, in <module> dlrm_main/0 [0]: from torchmetrics.functional.audio.pit import permutation_invariant_training, pit_permutate # noqa: F401 dlrm_main/0 [0]: File "/opt/conda/lib/python3.10/site-packages/torchmetrics/functional/audio/pit.py", line 21, in <module> dlrm_main/0 [0]: from torchmetrics.utilities.imports import _SCIPY_AVAILABLE dlrm_main/0 [0]: File "/opt/conda/lib/python3.10/site-packages/torchmetrics/utilities/__init__.py", line 1, in <module> dlrm_main/0 [0]: from torchmetrics.utilities.checks import check_forward_full_state_property # noqa: F401 dlrm_main/0 [0]: File "/opt/conda/lib/python3.10/site-packages/torchmetrics/utilities/checks.py", line 22, in <module> dlrm_main/0 [0]: from torchmetrics.utilities.data import select_topk, to_onehot dlrm_main/0 [0]: File "/opt/conda/lib/python3.10/site-packages/torchmetrics/utilities/data.py", line 19, in <module> dlrm_main/0 [0]: from torchmetrics.utilities.imports import _TORCH_GREATER_EQUAL_1_12 dlrm_main/0 [0]: File "/opt/conda/lib/python3.10/site-packages/torchmetrics/utilities/imports.py", line 113, in <module> dlrm_main/0 [0]: _TORCHVISION_GREATER_EQUAL_0_8: Optional[bool] = _compare_version("torchvision", operator.ge, "0.8.0") dlrm_main/0 [0]: File "/opt/conda/lib/python3.10/site-packages/torchmetrics/utilities/imports.py", line 79, in _compare_version dlrm_main/0 [0]: if not _module_available(package): dlrm_main/0 [0]: File "/opt/conda/lib/python3.10/site-packages/torchmetrics/utilities/imports.py", line 60, in _module_available dlrm_main/0 [0]: module = import_module(module_names[0]) dlrm_main/0 [0]: File "/opt/conda/lib/python3.10/importlib/__init__.py", line 126, in import_module dlrm_main/0 [0]: return _bootstrap._gcd_import(name[level:], package, level) dlrm_main/0 [0]: File "/opt/conda/lib/python3.10/site-packages/torchvision/__init__.py", line 5, in <module> dlrm_main/0 [0]: from torchvision import datasets, io, models, ops, transforms, utils dlrm_main/0 [0]: File "/opt/conda/lib/python3.10/site-packages/torchvision/models/__init__.py", line 17, in <module> dlrm_main/0 [0]: from . import detection, optical_flow, quantization, segmentation, video dlrm_main/0 [0]: File "/opt/conda/lib/python3.10/site-packages/torchvision/models/detection/__init__.py", line 1, in <module> dlrm_main/0 [0]: from .faster_rcnn import * dlrm_main/0 [0]: File "/opt/conda/lib/python3.10/site-packages/torchvision/models/detection/faster_rcnn.py", line 16, in <module> dlrm_main/0 [0]: from .anchor_utils import AnchorGenerator dlrm_main/0 [0]: File "/opt/conda/lib/python3.10/site-packages/torchvision/models/detection/anchor_utils.py", line 10, in <module> dlrm_main/0 [0]: class AnchorGenerator(nn.Module): dlrm_main/0 [0]: File "/opt/conda/lib/python3.10/site-packages/torchvision/models/detection/anchor_utils.py", line 63, in AnchorGenerator dlrm_main/0 [0]: device: torch.device = torch.device("cpu"), dlrm_main/0 [0]:/opt/conda/lib/python3.10/site-packages/torchvision/models/detection/anchor_utils.py:63: UserWarning: Failed to initialize NumPy: _ARRAY_API not found (Triggered internally at /opt/conda/conda-bld/pytorch_1670525552843/work/torch/csrc/utils/tensor_numpy.cpp:77.) dlrm_main/0 [0]: device: torch.device = torch.device("cpu"), dlrm_main/0 [1]: dlrm_main/0 [1]:A module that was compiled using NumPy 1.x cannot be run in dlrm_main/0 [1]:NumPy 2.0.1 as it may crash. To support both 1.x and 2.x dlrm_main/0 [1]:versions of NumPy, modules must be compiled with NumPy 2.0. dlrm_main/0 [1]:Some module may need to rebuild instead e.g. with 'pybind11>=2.12'. dlrm_main/0 [1]: dlrm_main/0 [1]:If you are a user of the module, the easiest solution will be to dlrm_main/0 [1]:downgrade to 'numpy<2' or try to upgrade the affected module. dlrm_main/0 [1]:We expect that some modules will need time to support NumPy 2. dlrm_main/0 [1]: dlrm_main/0 [1]:Traceback (most recent call last): File "/workspace/torchrec_dlrm/dlrm_main.py", line 19, in <module> dlrm_main/0 [1]: import torchmetrics as metrics dlrm_main/0 [1]: File "/opt/conda/lib/python3.10/site-packages/torchmetrics/__init__.py", line 14, in <module> dlrm_main/0 [1]: from torchmetrics import functional # noqa: E402 dlrm_main/0 [1]: File "/opt/conda/lib/python3.10/site-packages/torchmetrics/functional/__init__.py", line 14, in <module> dlrm_main/0 [1]: from torchmetrics.functional.audio.pit import permutation_invariant_training, pit_permutate dlrm_main/0 [1]: File "/opt/conda/lib/python3.10/site-packages/torchmetrics/functional/audio/__init__.py", line 14, in <module> dlrm_main/0 [1]: from torchmetrics.functional.audio.pit import permutation_invariant_training, pit_permutate # noqa: F401 dlrm_main/0 [1]: File "/opt/conda/lib/python3.10/site-packages/torchmetrics/functional/audio/pit.py", line 21, in <module> dlrm_main/0 [1]: from torchmetrics.utilities.imports import _SCIPY_AVAILABLE dlrm_main/0 [1]: File "/opt/conda/lib/python3.10/site-packages/torchmetrics/utilities/__init__.py", line 1, in <module> dlrm_main/0 [1]: from torchmetrics.utilities.checks import check_forward_full_state_property # noqa: F401 dlrm_main/0 [1]: File "/opt/conda/lib/python3.10/site-packages/torchmetrics/utilities/checks.py", line 22, in <module> dlrm_main/0 [1]: from torchmetrics.utilities.data import select_topk, to_onehot dlrm_main/0 [1]: File "/opt/conda/lib/python3.10/site-packages/torchmetrics/utilities/data.py", line 19, in <module> dlrm_main/0 [1]: from torchmetrics.utilities.imports import _TORCH_GREATER_EQUAL_1_12 dlrm_main/0 [1]: File "/opt/conda/lib/python3.10/site-packages/torchmetrics/utilities/imports.py", line 113, in <module> dlrm_main/0 [1]: _TORCHVISION_GREATER_EQUAL_0_8: Optional[bool] = _compare_version("torchvision", operator.ge, "0.8.0") dlrm_main/0 [1]: File "/opt/conda/lib/python3.10/site-packages/torchmetrics/utilities/imports.py", line 79, in _compare_version dlrm_main/0 [1]: if not _module_available(package): dlrm_main/0 [1]: File "/opt/conda/lib/python3.10/site-packages/torchmetrics/utilities/imports.py", line 60, in _module_available dlrm_main/0 [1]: module = import_module(module_names[0]) dlrm_main/0 [1]: File "/opt/conda/lib/python3.10/importlib/__init__.py", line 126, in import_module dlrm_main/0 [1]: return _bootstrap._gcd_import(name[level:], package, level) dlrm_main/0 [1]: File "/opt/conda/lib/python3.10/site-packages/torchvision/__init__.py", line 5, in <module> dlrm_main/0 [1]: from torchvision import datasets, io, models, ops, transforms, utils dlrm_main/0 [1]: File "/opt/conda/lib/python3.10/site-packages/torchvision/models/__init__.py", line 17, in <module> dlrm_main/0 [1]: from . import detection, optical_flow, quantization, segmentation, video dlrm_main/0 [1]: File "/opt/conda/lib/python3.10/site-packages/torchvision/models/detection/__init__.py", line 1, in <module> dlrm_main/0 [1]: from .faster_rcnn import * dlrm_main/0 [1]: File "/opt/conda/lib/python3.10/site-packages/torchvision/models/detection/faster_rcnn.py", line 16, in <module> dlrm_main/0 [1]: from .anchor_utils import AnchorGenerator dlrm_main/0 [1]: File "/opt/conda/lib/python3.10/site-packages/torchvision/models/detection/anchor_utils.py", line 10, in <module> dlrm_main/0 [1]: class AnchorGenerator(nn.Module): dlrm_main/0 [1]: File "/opt/conda/lib/python3.10/site-packages/torchvision/models/detection/anchor_utils.py", line 63, in AnchorGenerator dlrm_main/0 [1]: device: torch.device = torch.device("cpu"), dlrm_main/0 [1]:/opt/conda/lib/python3.10/site-packages/torchvision/models/detection/anchor_utils.py:63: UserWarning: Failed to initialize NumPy: _ARRAY_API not found (Triggered internally at /opt/conda/conda-bld/pytorch_1670525552843/work/torch/csrc/utils/tensor_numpy.cpp:77.) dlrm_main/0 [1]: device: torch.device = torch.device("cpu"), dlrm_main/0 [1]:Traceback (most recent call last): dlrm_main/0 [1]: File "/workspace/torchrec_dlrm/dlrm_main.py", line 939, in <module> dlrm_main/0 [1]: main(sys.argv[1:]) dlrm_main/0 [1]: File "/workspace/torchrec_dlrm/dlrm_main.py", line 813, in main dlrm_main/0 [1]: plan = planner.collective_plan( dlrm_main/0 [1]: File "/opt/conda/lib/python3.10/site-packages/torchrec/distributed/planner/planners.py", line 177, in collective_plan dlrm_main/0 [1]: return invoke_on_rank_and_broadcast_result( dlrm_main/0 [1]: File "/opt/conda/lib/python3.10/site-packages/torchrec/distributed/collective_utils.py", line 58, in invoke_on_rank_and_broadcast_result dlrm_main/0 [1]: dist.broadcast_object_list(object_list, rank, group=pg) dlrm_main/0 [1]: File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2106, in broadcast_object_list dlrm_main/0 [1]: object_list[i] = _tensor_to_object(obj_view, obj_size) dlrm_main/0 [1]: File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1803, in _tensor_to_object dlrm_main/0 [1]: buf = tensor.numpy().tobytes()[:tensor_size] dlrm_main/0 [1]:RuntimeError: Numpy is not available dlrm_main/0 [0]:Traceback (most recent call last): dlrm_main/0 [0]: File "/workspace/torchrec_dlrm/dlrm_main.py", line 939, in <module> dlrm_main/0 [0]: main(sys.argv[1:]) dlrm_main/0 [0]: File "/workspace/torchrec_dlrm/dlrm_main.py", line 817, in main dlrm_main/0 [0]: model = DistributedModelParallel( dlrm_main/0 [0]: File "/opt/conda/lib/python3.10/site-packages/torchrec/distributed/model_parallel.py", line 232, in __init__ dlrm_main/0 [0]: self.init_data_parallel() dlrm_main/0 [0]: File "/opt/conda/lib/python3.10/site-packages/torchrec/distributed/model_parallel.py", line 266, in init_data_parallel dlrm_main/0 [0]: self._data_parallel_wrapper.wrap(self, self._env, self.device) dlrm_main/0 [0]: File "/opt/conda/lib/python3.10/site-packages/torchrec/distributed/model_parallel.py", line 97, in wrap dlrm_main/0 [0]: DistributedDataParallel( dlrm_main/0 [0]: File "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 655, in __init__ dlrm_main/0 [0]: _verify_param_shape_across_processes(self.process_group, parameters) dlrm_main/0 [0]: File "/opt/conda/lib/python3.10/site-packages/torch/distributed/utils.py", line 112, in _verify_param_shape_across_processes dlrm_main/0 [0]: return dist._verify_params_across_processes(process_group, tensors, logger) dlrm_main/0 [0]:RuntimeError: [/opt/conda/conda-bld/pytorch_1670525552843/work/third_party/gloo/gloo/transport/tcp/pair.cc:598] Connection closed by peer [172.20.0.2]:54499 dlrm_main/0 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 25) of binary: /opt/conda/bin/python dlrm_main/0 [0]:libcuda.so.1: cannot open shared object file: No such file or directory dlrm_main/0 [1]:libcuda.so.1: cannot open shared object file: No such file or directory dlrm_main/0 [0]::::MLLOG {"namespace": "", "time_ms": 1722889625323, "event_type": "POINT_IN_TIME", "key": "cache_clear", "value": true, "metadata": {"file": "/workspace/torchrec_dlrm/dlrm_main.py", "lineno": 660}} dlrm_main/0 [0]::::MLLOG {"namespace": "", "time_ms": 1722889625376, "event_type": "INTERVAL_START", "key": "init_start", "value": null, "metadata": {"file": "/workspace/torchrec_dlrm/dlrm_main.py", "lineno": 661}} dlrm_main/0 [1]::::MLLOG {"namespace": "", "time_ms": 1722889625323, "event_type": "POINT_IN_TIME", "key": "cache_clear", "value": true, "metadata": {"file": "/workspace/torchrec_dlrm/dlrm_main.py", "lineno": 660}} dlrm_main/0 [1]::::MLLOG {"namespace": "", "time_ms": 1722889625376, "event_type": "INTERVAL_START", "key": "init_start", "value": null, "metadata": {"file": "/workspace/torchrec_dlrm/dlrm_main.py", "lineno": 661}} dlrm_main/0 [0]:{'adagrad': False, dlrm_main/0 [0]: 'allow_tf32': False, dlrm_main/0 [0]: 'batch_size': 32, dlrm_main/0 [0]: 'collect_multi_hot_freqs_stats': False, dlrm_main/0 [0]: 'dataset_name': 'criteo_1t', dlrm_main/0 [0]: 'dcn_low_rank_dim': 512, dlrm_main/0 Traceback (most recent call last): dlrm_main/0 File "/opt/conda/bin/torchrun", line 33, in <module> dlrm_main/0 sys.exit(load_entry_point('torch==1.13.1', 'console_scripts', 'torchrun')()) dlrm_main/0 File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper dlrm_main/0 [0]: 'dcn_num_layers': 3, dlrm_main/0 [0]: 'dense_arch_layer_sizes': [512, 256, 64], dlrm_main/0 [0]: 'drop_last_training_batch': False, dlrm_main/0 [0]: 'embedding_dim': 64, dlrm_main/0 [0]: 'epochs': 1, dlrm_main/0 [0]: 'evaluate_on_epoch_end': False, dlrm_main/0 [0]: 'evaluate_on_training_end': False, dlrm_main/0 [0]: 'in_memory_binary_criteo_path': None, dlrm_main/0 [0]: 'interaction_branch1_layer_sizes': [2048, 2048], dlrm_main/0 [0]: 'interaction_branch2_layer_sizes': [2048, 2048], dlrm_main/0 [0]: 'interaction_type': <InteractionType.ORIGINAL: 'original'>, dlrm_main/0 [0]: 'learning_rate': 15.0, dlrm_main/0 [0]: 'limit_test_batches': None, dlrm_main/0 [0]: 'limit_train_batches': None, dlrm_main/0 [0]: 'limit_val_batches': None, dlrm_main/0 [0]: 'lr_decay_start': 0, dlrm_main/0 [0]: 'lr_decay_steps': 0, dlrm_main/0 [0]: 'lr_warmup_steps': 0, dlrm_main/0 [0]: 'mmap_mode': False, dlrm_main/0 [0]: 'multi_hot_distribution_type': None, dlrm_main/0 [0]: 'multi_hot_sizes': None, dlrm_main/0 [0]: 'num_embeddings': 100000, dlrm_main/0 [0]: 'num_embeddings_per_feature': None, dlrm_main/0 [0]: 'over_arch_layer_sizes': [512, 512, 256, 1], dlrm_main/0 [0]: 'pin_memory': False, dlrm_main/0 [0]: 'print_lr': False, dlrm_main/0 [0]: 'print_progress': False, dlrm_main/0 [0]: 'print_sharding_plan': False, dlrm_main/0 [0]: 'seed': None, dlrm_main/0 [0]: 'shuffle_batches': False, dlrm_main/0 [0]: 'shuffle_training_set': False, dlrm_main/0 [0]: 'synthetic_multi_hot_criteo_path': None, dlrm_main/0 [0]: 'test_batch_size': None, dlrm_main/0 [0]: 'validation_auroc': None, dlrm_main/0 [0]: 'validation_freq_within_epoch': None} dlrm_main/0 [0]::::MLLOG {"namespace": "", "time_ms": 1722889626392, "event_type": "POINT_IN_TIME", "key": "submission_benchmark", "value": "dlrm_dcnv2", "metadata": {"file": "/workspace/torchrec_dlrm/mlperf_logging_utils.py", "lineno": 7}} dlrm_main/0 [0]::::MLLOG {"namespace": "", "time_ms": 1722889626392, "event_type": "POINT_IN_TIME", "key": "submission_org", "value": "reference_implementation", "metadata": {"file": "/workspace/torchrec_dlrm/mlperf_logging_utils.py", "lineno": 11}} dlrm_main/0 [0]::::MLLOG {"namespace": "", "time_ms": 1722889626392, "event_type": "POINT_IN_TIME", "key": "submission_division", "value": "closed", "metadata": {"file": "/workspace/torchrec_dlrm/mlperf_logging_utils.py", "lineno": 15}} dlrm_main/0 [0]::::MLLOG {"namespace": "", "time_ms": 1722889626392, "event_type": "POINT_IN_TIME", "key": "submission_status", "value": "onprem", "metadata": {"file": "/workspace/torchrec_dlrm/mlperf_logging_utils.py", "lineno": 19}} dlrm_main/0 [0]::::MLLOG {"namespace": "", "time_ms": 1722889626393, "event_type": "POINT_IN_TIME", "key": "submission_platform", "value": "reference_implementation", "metadata": {"file": "/workspace/torchrec_dlrm/mlperf_logging_utils.py", "lineno": 23}} dlrm_main/0 [0]::::MLLOG {"namespace": "", "time_ms": 1722889626393, "event_type": "POINT_IN_TIME", "key": "global_batch_size", "value": 64, "metadata": {"file": "/workspace/torchrec_dlrm/dlrm_main.py", "lineno": 705}} dlrm_main/0 [0]::::MLLOG {"namespace": "", "time_ms": 1722889626393, "event_type": "POINT_IN_TIME", "key": "gradient_accumulation_steps", "value": 1, "metadata": {"file": "/workspace/torchrec_dlrm/dlrm_main.py", "lineno": 709}} dlrm_main/0 return f(*args, **kwargs) dlrm_main/0 File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in main dlrm_main/0 [0]::::MLLOG {"namespace": "", "time_ms": 1722889626393, "event_type": "POINT_IN_TIME", "key": "seed", "value": null, "metadata": {"file": "/workspace/torchrec_dlrm/dlrm_main.py", "lineno": 713}} dlrm_main/0 run(args) dlrm_main/0 File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run dlrm_main/0 elastic_launch( dlrm_main/0 File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ dlrm_main/0 return launch_agent(self._config, self._entrypoint, list(args)) dlrm_main/0 File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent dlrm_main/0 raise ChildFailedError( dlrm_main/0 torch.distributed.elastic.multiprocessing.errors.ChildFailedError: dlrm_main/0 ============================================================ dlrm_main/0 dlrm_main.py FAILED dlrm_main/0 ------------------------------------------------------------ dlrm_main/0 Failures: dlrm_main/0 [1]: dlrm_main/0 time : 2024-08-05_20:27:09 dlrm_main/0 host : dlrm_main-sbz7tbpcb2sqvd-dlrm_main-0 dlrm_main/0 rank : 1 (local_rank: 1) dlrm_main/0 exitcode : 1 (pid: 26) dlrm_main/0 error_file: <N/A> dlrm_main/0 traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html dlrm_main/0 ------------------------------------------------------------ dlrm_main/0 Root Cause (first observed failure): dlrm_main/0 [0]: dlrm_main/0 time : 2024-08-05_20:27:09 dlrm_main/0 host : dlrm_main-sbz7tbpcb2sqvd-dlrm_main-0 dlrm_main/0 rank : 0 (local_rank: 0) dlrm_main/0 exitcode : 1 (pid: 25) dlrm_main/0 error_file: <N/A> dlrm_main/0 traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html dlrm_main/0 ============================================================ torchx 2024-08-05 13:27:10 INFO Job finished: FAILED torchx 2024-08-05 13:27:10 ERROR AppStatus: msg: <NONE> num_restarts: -1 roles: - replicas: - hostname: dlrm_main-sbz7tbpcb2sqvd-dlrm_main-0 id: 0 role: dlrm_main state: !!python/object/apply:torchx.specs.api.AppState - 5 structured_error_msg: <NONE> role: dlrm_main state: FAILED (5) structured_error_msg: <NONE> ui_url: null
Unable to run TorchRec DLRM using the provided Dockerfile and requirements.txt. I'm using the latest revision of the master branch.