Some failures in MLPerf inference detected

gfursin commented 2 hours ago

Hi, I noticed that the latest cm4mlops fails on some MLPerf inference benchmarks.

See https://github.com/mlcommons/ck/actions/runs/11475369712/job/31933071063?pr=1338

I think there is an error in benchmark-program-mlperf:

 Found script::benchmark-program-mlperf,cfff0132a8aa4018 in /home/runner/CM/repos/mlcommons@cm4mlops/script/benchmark-program-mlperf
DEBUG:root:      Prepared variations: _no-power
/home/runner/CM/repos/mlcommons@cm4mlops/script/benchmark-program-mlperf/customize.py:24: SyntaxWarning: invalid escape sequence '\$'
  env['CM_MLPERF_RUN_CMD'] = "CM_MLPERF_RUN_COUNT=\$(cat \${CM_RUN_DIR}/count.txt); echo \${CM_MLPERF_RUN_COUNT};  CM_MLPERF_RUN_COUNT=\$((CM_MLPERF_RUN_COUNT+1));   echo \${CM_MLPERF_RUN_COUNT} > \${CM_RUN_DIR}/count.txt && if [ \${CM_MLPERF_RUN_COUNT} -eq \'1\' ]; then export CM_MLPERF_USER_CONF=\${CM_MLPERF_RANGING_USER_CONF}; else export CM_MLPERF_USER_CONF=\${CM_MLPERF_TESTING_USER_CONF}; fi && "+env.get('CM_RUN_CMD','').strip()

But maybe there is something else ...

Is it possible to check it, please? I can then rerun the tests at mlcommons/ck ...

Thanks!!!

gfursin commented 2 hours ago

I also see these errors:

./run_local.sh onnxruntime resnet50 cpu --scenario Offline    --count 500 --mlperf_conf '/home/runner/CM/repos/local/cache/d577e1c466de445b/inference/mlperf.conf' --threads 4 --user_conf '/home/runner/CM/repos/mlcommons@cm4mlops/script/generate-mlperf-inference-user-conf/tmp/abca6703444849f8bfed510cd87575a3.conf' --accuracy --use_preprocessed_dataset --cache_dir /home/runner/CM/repos/local/cache/410a243140aa4804 --dataset-list /home/runner/CM/repos/local/cache/7bc2622b4b32429e/val.txt 2>&1 ; echo $? > exitstatus | tee '/home/runner/CM/repos/local/cache/bbb8940adcb14207/test_results/default-reference-cpu-onnxruntime-v1.19.2-default_config/resnet50/offline/accuracy/console.out'
python3 python/main.py --profile resnet50-onnxruntime --model "/home/runner/CM/repos/local/cache/8ca7cab8a00d424a/resnet50_v1.onnx" --dataset-path /home/runner/CM/repos/local/cache/410a243140aa4804 --output "/home/runner/CM/repos/local/cache/bbb8940adcb14207/test_results/default-reference-cpu-onnxruntime-v1.19.2-default_config/resnet50/offline/accuracy" --scenario Offline --count 500 --mlperf_conf /home/runner/CM/repos/local/cache/d577e1c466de445b/inference/mlperf.conf --threads 4 --user_conf /home/runner/CM/repos/mlcommons@cm4mlops/script/generate-mlperf-inference-user-conf/tmp/abca6703444849f8bfed510cd87575a3.conf --accuracy --use_preprocessed_dataset --cache_dir /home/runner/CM/repos/local/cache/410a243140aa4804 --dataset-list /home/runner/CM/repos/local/cache/7bc2622b4b32429e/val.txt
usage: main.py [-h]
               [--dataset {imagenet,imagenet_mobilenet,imagenet_tflite_tpu,imagenet_pytorch,coco-300,coco-300-pt,openimages-300-retinanet,openimages-800-retinanet,openimages-1200-retinanet,openimages-800-retinanet-onnx,coco-1200,coco-1200-onnx,coco-1200-pt,coco-1200-tf}]
               --dataset-path DATASET_PATH [--dataset-list DATASET_LIST]
               [--data-format {NCHW,NHWC}]
               [--profile {defaults,resnet50-tf,resnet50-pytorch,resnet50-onnxruntime,resnet50-ncnn,resnet50-tflite,mobilenet-tf,mobilenet-onnxruntime,ssd-mobilenet-tf,ssd-mobilenet-pytorch,ssd-mobilenet-onnxruntime,ssd-resnet34-tf,ssd-resnet34-pytorch,ssd-resnet34-onnxruntime,ssd-resnet34-onnxruntime-tf,retinanet-pytorch,retinanet-onnxruntime}]
               [--scenario SCENARIO] [--max-batchsize MAX_BATCHSIZE] --model
               MODEL [--output OUTPUT] [--inputs INPUTS] [--outputs OUTPUTS]
               [--backend BACKEND] [--device DEVICE] [--model-name MODEL_NAME]
               [--threads THREADS] [--qps QPS] [--cache CACHE]
               [--cache_dir CACHE_DIR] [--preprocessed_dir PREPROCESSED_DIR]
               [--use_preprocessed_dataset] [--accuracy]
               [--find-peak-performance] [--debug] [--user_conf USER_CONF]
               [--audit_conf AUDIT_CONF] [--time TIME] [--count COUNT]
               [--performance-sample-count PERFORMANCE_SAMPLE_COUNT]
               [--max-latency MAX_LATENCY]
               [--samples-per-query SAMPLES_PER_QUERY]
main.py: error: unrecognized arguments: --mlperf_conf /home/runner/CM/repos/local/cache/d577e1c466de445b/inference/mlperf.conf

Hope it's of any help ... Thanks!

arjunsuresh commented 2 hours ago

Hi @gfursin the test is run from the main branch and not the mlperf-inference branch of the cm4mlops repository. It is failing due to a recent change in the loadgen. It should be fine in the mlperf-inference branch.

mlcommons / cm4mlops

Some failures in MLPerf inference detected #410