mlcommons / storage

MLPerf™ Storage Benchmark Suite
https://mlcommons.org/en/groups/research-storage/
Apache License 2.0
82 stars 28 forks source link

Benchmark run failing due assertion error related to the number of subfolders #41

Open rodrigonascimento opened 1 year ago

rodrigonascimento commented 1 year ago

While executing the benchmark.sh script to run the benchmark, I was getting the following error:

./benchmark.sh run --workload unet3d --num-accelerators 1 --results-dir unet3d_results --param dataset.data_folder=/raid/unet3d_data/ --param dataset.num_files_train=37500 [INFO] 2023-08-01T14:57:33.254442 Running DLIO with 1 process(es) [/root/MLPERF/dlio_benchmark/src/dlio_benchmark.py:104] [INFO] 2023-08-01T14:57:33.254586 Reading workload YAML config file '/root/MLPERF/storage-conf/workload/unet3d.yaml' [/root/MLPERF/dlio_benchmark/src/dlio_benchmark.py:106] Error executing job with overrides: ['workload=unet3d', '++workload.workflow.generate_data=False', '++workload.workflow.train=True', '++workload.dataset.data_folder=/raid/unet3d_data/', '++workload.dataset.num_files_train=37500', '++workload.workflow.profiling=False', '++workload.profiling.profiler=none'] Traceback (most recent call last): File "/root/MLPERF/dlio_benchmark/src/dlio_benchmark.py", line 343, in main benchmark.initialize() File "/root/MLPERF/dlio_benchmark/src/dlio_benchmark.py", line 174, in initialize self.framework.init_reader(self.args.format, self.args.data_loader) File "/root/MLPERF/dlio_benchmark/src/framework/torch_framework.py", line 59, in init_reader self.reader_train = ReaderFactory.get_reader(format_type, data_loader=data_loader, dataset_type=DatasetType.TRAIN) File "/root/MLPERF/dlio_benchmark/src/reader/reader_factory.py", line 55, in get_reader return TorchDataLoaderReader(dataset_type) File "/root/MLPERF/dlio_benchmark/src/reader/torch_data_loader_reader.py", line 92, in init super().init(dataset_type) File "/root/MLPERF/dlio_benchmark/src/reader/reader_handler.py", line 83, in init assert(num_subfolders == len(filenames)) AssertionError

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

After debugging the issue, I found out I was missing the following parameter: --param dataset.num_subfolders_train=10

Should the dataset.num_subfolders_train parameter be set to 10 by default?

johnugeorge commented 1 year ago

You have to set the same number of subfolders in the 'run' command as you have set in the 'datagen' command ( in case if you are using subfolders. Else you can skip in both)

rodrigonascimento commented 1 year ago

Thanks, @johnugeorge! I got the commands (datagen and run) from the examples in the README.md. The datagen command specifies subfolders, the run command doesn't.

ale-goncalves commented 8 months ago

Yes @rodrigonascimento had the same problem. Had to manually add the --param dataset subfolders in the run matching the datagen example and it worked.