mlcommons / storage

MLPerf™ Storage Benchmark Suite
https://mlcommons.org/en/groups/research-storage/
Apache License 2.0
61 stars 19 forks source link

fork error when running on Kubernetes #44

Open uprush opened 6 months ago

uprush commented 6 months ago

Hi,

The benchmark failed with the following error when running on Kubernetes. I was able to workaround it by setting environment variable RDMAV_FORK_SAFE=0, but not sure whether there is any performance impact and other issues.

root@mlperf-storage:/mlperf/storage# ./benchmark.sh run --workload unet3d --num-accelerators 8 --results-dir /mnt/fb1/unet3d_results --param dataset.data_folder=/mnt/fb1/unet3d_data --param dataset.num_subfolders_train=16 --param dataset.num_files_train=4687
[INFO] 2023-12-12T07:16:13.865342 Running DLIO with 8 process(es) [/mlperf/storage/dlio_benchmark/src/dlio_benchmark.py:104]
[INFO] 2023-12-12T07:16:13.865599 Reading workload YAML config file '/mlperf/storage/storage-conf/workload/unet3d.yaml' [/mlperf/storage/dlio_benchmark/src/dlio_benchmark.py:106]
[INFO] 2023-12-12T07:16:13.979505 Max steps per epoch: 146 = 1 * 4687 / 4 / 8 (samples per file * num files / batch size / comm size) [/mlperf/storage/dlio_benchmark/src/dlio_benchmark.py:274]
[INFO] 2023-12-12T07:16:13.979733 Starting epoch 1: 146 steps expected [/mlperf/storage/dlio_benchmark/src/utils/statscounter.py:129]
[INFO] 2023-12-12T07:16:13.980126 Prefetch size is 0; a default prefetch factor of 2 will be set to Torch DataLoader. [/mlperf/storage/dlio_benchmark/src/reader/torch_data_loader_reader.py:123]
[INFO] 2023-12-12T07:16:13.980436 Starting block 1 [/mlperf/storage/dlio_benchmark/src/utils/statscounter.py:195]
A process has executed an operation involving a call
to the fork() system call to create a child process.

As a result, the libfabric EFA provider is operating in
a condition that could result in memory corruption or
other system errors.

For the libfabric EFA provider to work safely when fork()
is called, you will need to set the following environment
variable:
          RDMAV_FORK_SAFE

However, setting this environment variable can result in
signficant performance impact to your application due to
increased cost of memory registration.

You may want to check with your application vendor to see
if an application-level alternative (of not using fork)
exists.

Your job will now abort.

python3:1099 terminated with signal 6 at PC=7f9034457a7c SP=7ffcd0aa49c0.  Backtrace:
/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7f9034457a7c]
/lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7f9034403476]
/lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7f90343e97f3]
/lib/x86_64-linux-gnu/libfabric.so.1(+0x76b4e)[0x7f8ea631eb4e]
/lib/x86_64-linux-gnu/libc.so.6(+0xeafb8)[0x7f90344abfb8]
/lib/x86_64-linux-gnu/libc.so.6(__libc_fork+0x71)[0x7f90344ab781]
johnugeorge commented 4 months ago

Use RDMAV_FORK_SAFE=1 ./benchmark.sh run --workload unet3d --num-accelerators 8 --results-dir /mnt/fb1/unet3d_results --param dataset.data_folder=/mnt/fb1/unet3d_data --param dataset.num_subfolders_train=16 --param dataset.num_files_train=4687