mlcommons / training_results_v1.0

This repository contains the results and code for the MLPerf™ Training v1.0 benchmark.
https://mlcommons.org/en/training-normal-10/
Apache License 2.0
37 stars 43 forks source link

DataLoader cash when using FI_EFA_USE_DEVICE_RDMA=1 #2

Closed tohaowu closed 3 years ago

tohaowu commented 3 years ago

Our AWS p4d.24xlarge job passed on 08/24, and the throughput was 3511 samples/second. We used two p4d.24xlarges with FI_PROVIDER="efa" and FI_EFA_USE_DEVICE_RDMA=1

This test failed recently. The error message is following

File "/workspace/bert/run_pretraining.py", line 1592, in args, final_loss, train_time_raw = main() File "/workspace/bert/run_pretraining.py", line 1344, in main for step, batch in enumerate(train_dataloader): File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 356, in iter return self._get_iterator() File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 302, in _get_iterator return _MultiProcessingDataLoaderIter(self) File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 941, in init self._reset(loader, first_iter=True) File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 972, in _reset self._try_put_index() File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1206, in _try_put_index index = self._next_index() File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 509, in _next_index return next(self._sampler_iter) # may raise StopIteration File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/sampler.py", line 226, in iter for idx in self.sampler: File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/sampler.py", line 124, in iter yield from torch.randperm(n, generator=generator).tolist() File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler _error_if_any_worker_fails() RuntimeError: DataLoader worker (pid 827) is killed by signal: Segmentation fault.

When test without FI_EFA_USE_DEVICE_RDMA=1, the test passes. But the throughput is 1673 samples/sec.

This is the dockerfile we used. https://github.com/aws-samples/aws-efa-nccl-baseami-pipeline/blob/master/nvidia-efa-docker_base/Dockerfile.base