Running dlrm cpu inference ends up using resnet50

mlcommons / ck

Collective Mind (CM) is a small, modular, cross-platform and decentralized workflow automation framework with a human-friendly interface and reusable automation recipes to make it easier to build, run, benchmark and optimize AI, ML and other applications and systems across diverse and continuously changing models, data, software and hardware

https://access.cKnowledge.org/challenges

Apache License 2.0

597 stars 110 forks source link

Running dlrm cpu inference ends up using resnet50 #1214

Open yuyantingzero opened 4 months ago

yuyantingzero commented 4 months ago

Ran with cm run script "app mlperf reference inference _dlrm _cpu" --env.CM_RERUN.

From the attached log file, I saw pulling dlrm cm run script "get dlrm src", but when actually running the benchmark, it becames python3 python/main.py --profile resnet50-onnxruntime --mlperf_conf ../../mlperf.conf --model "/home/perfkit/CM/repos/local/cache/0f978e30da8a423a/resnet50_v1.onnx" ...

Any help is appreciated, thanks!

dlrm.txt

arjunsuresh commented 4 months ago

Can you please share the README link you are following here? Dlrm is replaced by dlrm_v2 in the MLPerf Inference and only dlrm_v2 is supported in CM. For dlrm_v2, the download of the criteo dataset should be done manually.

yuyantingzero commented 4 months ago

This is the repo/script I am looking at.

Upon checking, did use the wrong model name, changing to _dlrm_99 gets me a little bit further. But now i am running into

Traceback (most recent call last):
  File "/home/perfkit/CM/repos/local/cache/ae4e0392ff6d45b1/inference/recommendation/dlrm_v2/pytorch/python/main.py", line 20, in <module>
    import mlperf_loadgen as lg
ModuleNotFoundError: No module named 'mlperf_loadgen'

For data downloading, is there a page I can reference? I tired cm run script --tags=get,ml-model,dlrm,raw,terabyte,criteo-terabyte,criteo,recommendation from get-ml-model-dlrm-terabyte which downloads a 90GB file tb00_40M.onnx.tar