mlcommons / inference_results_v3.0

This repository contains the results and code for the MLPerf™ Inference v3.0 benchmark.
https://mlcommons.org/en/inference-datacenter-30/
Apache License 2.0
18 stars 15 forks source link

Check failed: mSampleStartIdxs.back() == mNumIndividualPairs #15

Open jarrettbranch opened 1 year ago

jarrettbranch commented 1 year ago

I'm following the NVIDIA steps to replicate the dlrm results. Shortly after kicking off the test, it fails with this error:

I0824 22:32:44.635635  1620 main_dlrm.cc:150] Found 1 GPUs
I0824 22:32:44.637660  1620 main_dlrm.cc:194] Loaded 330067 sample partitions. (1320272) bytes.
F0824 22:32:45.557823  1620 dlrm_qsl.hpp:38] Check failed: mSampleStartIdxs.back() == mNumIndividualPairs (89137319 vs. 128000)
*** Check failure stack trace: ***
    @     0x7f44031e8f00  google::LogMessage::Fail()
    @     0x7f44031e8e3b  google::LogMessage::SendToLog()
    @     0x7f44031e876c  google::LogMessage::Flush()
    @     0x7f44031ebd7a  google::LogMessageFatal::~LogMessageFatal()
    @     0x559aa15ae7c8  DLRMSampleLibrary::DLRMSampleLibrary()
    @     0x559aa158acf0  main
    @     0x7f4402c71083  __libc_start_main
    @     0x559aa158b6de  _start
    @              (nil)  (unknown)
Aborted (core dumped)
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/work/code/main.py", line 233, in <module>
    main(main_args, DETECTED_SYSTEM)
  File "/work/code/main.py", line 146, in main
    dispatch_action(main_args, config_dict, workload_setting)
  File "/work/code/main.py", line 204, in dispatch_action
    handler.run()
  File "/work/code/actionhandler/base.py", line 82, in run
    self.handle_failure()
  File "/work/code/actionhandler/run_harness.py", line 244, in handle_failure
    raise RuntimeError("Run harness failed!")
RuntimeError: Run harness failed!
Traceback (most recent call last):
  File "/work/code/actionhandler/run_harness.py", line 215, in handle
    result_data = self.harness.run_harness(flag_dict=self.harness_flag_dict, skip_generate_measurements=True)
  File "/work/code/common/harness.py", line 326, in run_harness
    output = run_command(cmd, get_output=True, custom_env=self.env_vars)
  File "/work/code/common/__init__.py", line 65, in run_command
    raise subprocess.CalledProcessError(ret, cmd)