mlcommons / inference_results_v2.0

This repository contains the results and code for the MLPerf™ Inference v2.0 benchmark.
https://mlcommons.org/en/inference-datacenter-20/
Apache License 2.0
9 stars 12 forks source link

DLRM failed to run #13

Closed quic-nmorillo closed 1 year ago

quic-nmorillo commented 2 years ago

Hi,

I followed instructions here to run DLRM workload: https://github.com/mlcommons/inference_results_v2.0/tree/master/closed/NVIDIA

I was able to build an engine: make generate_engines RUN_ARGS="--benchmarks=dlrm --scenarios=offline --test_mode=PerformanceOnly --verbose --config_ver=high_accuracy"

Layer(PluginV2): (Unnamed Layer* 0) [PluginV2DynamicExt], Tactic: 0, numerical_input[Int8(-4,13,1,1)] -> bot_l.4.relu.output[Int8(-4,128,1,1)]
Layer(PluginV2): interaction_plugin, Tactic: 0, bot_l.4.relu.output[Int8(-4,128,1,1)], index_input[Int32(-2,26)] -> interaction_output_concat_output[Int8(-4,512,1,1)]
Layer(PluginV2): (Unnamed Layer* 3) [PluginV2DynamicExt], Tactic: 0, interaction_output_concat_output[Int8(-4,512,1,1)] -> top_l.0.relu.output[Int8(-4,1024,1,1)]
Layer(PluginV2): (Unnamed Layer* 4) [PluginV2DynamicExt], Tactic: 0, top_l.0.relu.output[Int8(-4,1024,1,1)] -> top_l.2.relu.output[Int8(-4,1024,1,1)]
Layer(PluginV2): (Unnamed Layer* 5) [PluginV2DynamicExt], Tactic: 0, top_l.2.relu.output[Int8(-4,1024,1,1)] -> top_l.4.relu.output[Int8(-4,512,1,1)]
Layer(PluginV2): (Unnamed Layer* 6) [PluginV2DynamicExt], Tactic: 0, top_l.4.relu.output[Int8(-4,512,1,1)] -> top_l.6.relu.output[Int8(-4,256,1,1)]
Layer(CaskConvolution): top_l.8, Tactic: -5287526117701381005, top_l.6.relu.output[Int8(-4,256,1,1)] -> top_l.8.output[Float(-4,1,1,1)]
Layer(PointWiseV2): PWN(sigmoid), Tactic: 1, top_l.8.output[Float(-4,1,1,1)] -> sigmoid_output[Float(-4,1,1,1)]
[08/25/2022-22:22:05] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +0, GPU +4, now: CPU 0, GPU 4 (MiB)
[2022-08-25 22:22:05,453 main.py:137 INFO] Finished building engines for dlrm benchmark in Offline scenario.
Loaded DLRM interactions plugin from build/plugins/DLRMInteractionsPlugin/libdlrminteractionsplugin.so for dlrm
Replacing top_l.0 with Small-Tile GEMM Plugin, with fairshare cache size 18
Replacing top_l.2 with Small-Tile GEMM Plugin, with fairshare cache size 18
Replacing top_l.4 with Small-Tile GEMM Plugin, with fairshare cache size 18
Replacing top_l.6 with Small-Tile GEMM Plugin, with fairshare cache size 18
Time taken to generate engines: 78.00617170333862 seconds

But I'm running into errors when trying the run inference ubuntu@mlperf-inference-ubuntu-x86_64:/work$ make run_harness RUN_ARGS="--benchmarks=dlrm --scenarios=offline --test_mode=PerformanceOnly --fast--verbose"

_[2022-08-25 22:23:29,994 main.py:770 INFO] Detected System ID: KnownSystem.A100_PCIe_40GBx1
[2022-08-25 22:23:31,193 main.py:249 INFO] Running harness for dlrm benchmark in Offline scenario...
[2022-08-25 22:23:31,216 harness.py:222 INFO] Updated LD_PRELOAD: /usr/lib/x86_64-linux-gnu/libjemalloc.so.2
[2022-08-25 22:23:31,217 __init__.py:43 INFO] Running command: ./build/bin/harness_dlrm --plugins="build/plugins/DLRMInteractionsPlugin/libdlrminteractionsplugin.so" --logfile_outdir="/work/build/logs/2022.08.25-22.23.12/A100-PCIex1_TRT/dlrm-99.9/Offline" --logfile_prefix="mlperf_log_" --performance_sample_count=204800 --test_mode="PerformanceOnly" --gpu_batch_size=315000 --tensor_path="build/preprocessed_data/criteo/full_recalib/numeric_int8_chw4.npy,build/preprocessed_data/criteo/full_recalib/categorical_int32.npy" --use_graphs=false --gpu_copy_streams=1 --complete_threads=1 --sample_partition_path="build/preprocessed_data/criteo/full_recalib/sample_partition.npy" --gpu_inference_streams=1 --num_staging_threads=8 --num_staging_batches=8 --max_pairs_per_staging_thread=262100 --gpu_num_bundles=2 --check_contiguity=true --gpu_engines="./build/engines/A100-PCIex1/dlrm/Offline/dlrm-Offline-gpu-b315000-int8.custom_k_99_9_MaxP.plan" --mlperf_conf_path="measurements/A100-PCIex1_TRT/dlrm-99.9/Offline/mlperf.conf" --user_conf_path="measurements/A100-PCIex1_TRT/dlrm-99.9/Offline/user.conf" --scenario Offline --model dlrm
[2022-08-25 22:23:31,217 __init__.py:50 INFO] Overriding Environment
benchmark : Benchmark.DLRM
check_contiguity : True
coalesced_tensor : True
complete_threads : 1
deque_timeout_usec : 1
enable_interleaved_top_mlp : False
gemm_plugin_fairshare_cache_size : 18
gpu_batch_size : 315000
gpu_copy_streams : 1
gpu_inference_streams : 1
gpu_num_bundles : 2
input_dtype : int8
input_format : chw4
max_pairs_per_staging_thread : 262100
num_staging_batches : 8
num_staging_threads : 8
offline_expected_qps : 270000
output_padding_granularity : 128
precision : int8
sample_partition_path : build/preprocessed_data/criteo/full_recalib/sample_partition.npy
scenario : Scenario.Offline
system : SystemConfiguration(host_cpu_conf=CPUConfiguration(layout={CPU(name='AMD EPYC 7643 48-Core Processor', architecture=<CPUArchitecture.x86_64: AliasedName(name='x86_64', aliases=(), patterns=())>, core_count=48, threads_per_core=1): 2}), host_mem_conf=MemoryConfiguration(host_memory_capacity=Memory(quantity=528.29164, byte_suffix=<ByteSuffix.GB: (1000, 3)>, _num_bytes=528291640000), comparison_tolerance=0.05), accelerator_conf=AcceleratorConfiguration(layout=defaultdict(<class 'int'>, {GPU(name='NVIDIA A100-PCIE-40GB', accelerator_type=<AcceleratorType.Discrete: AliasedName(name='Discrete', aliases=(), patterns=())>, vram=Memory(quantity=40.0, byte_suffix=<ByteSuffix.GiB: (1024, 3)>, _num_bytes=42949672960), max_power_limit=250.0, pci_id='0x20F110DE', compute_sm=80): 1})), numa_conf=NUMAConfiguration(numa_nodes={1: NUMANode(index=1, cpus=[Interval(start=48, end=95)], gpus=[0])}, num_numa_nodes=2), system_id='A100-PCIex1')
tensor_path : build/preprocessed_data/criteo/full_recalib/numeric_int8_chw4.npy,build/preprocessed_data/criteo/full_recalib/categorical_int32.npy
use_graphs : False
use_jemalloc : True
use_small_tile_gemm_plugin : True
config_name : A100-PCIex1_dlrm_Offline
config_ver : custom_k_99_9_MaxP
accuracy_level : 99.9%
optimization_level : plugin-enabled
inference_server : custom
system_id : A100-PCIex1
use_cpu : False
use_inferentia : False
power_limit : None
cpu_freq : None
test_mode : PerformanceOnly
openvino_version : f2f281e6
log_dir : /work/build/logs/2022.08.25-22.23.12
&&&& RUNNING DLRM_HARNESS # ./build/bin/harness_dlrm
I0825 22:23:31.302377  1377 main_dlrm.cc:138] Found 1 GPUs
I0825 22:23:31.304461  1377 main_dlrm.cc:181] Loaded 330067 sample partitions. (1320272) bytes.
F0825 22:23:32.742658  1377 dlrm_qsl.hpp:38] Check failed: mSampleStartIdxs.back() == mNumIndividualPairs (89137319 vs. 13760492) 
*** Check failure stack trace: ***
    @     0x7fea0c18cf00  google::LogMessage::Fail()
    @     0x7fea0c18ce3b  google::LogMessage::SendToLog()
    @     0x7fea0c18c76c  google::LogMessage::Flush()
    @     0x7fea0c18fd7a  google::LogMessageFatal::~LogMessageFatal()
    @     0x562ec3c02f9d  DLRMSampleLibrary::DLRMSampleLibrary()
    @     0x562ec3bdfe28  main
    @     0x7fea0bc15083  __libc_start_main
    @     0x562ec3be078e  _start
    @              (nil)  (unknown)
Aborted (core dumped)
Traceback (most recent call last):
  File "code/main.py", line 303, in handle_run_harness
    result = harness.run_harness()
  File "/work/code/common/harness.py", line 264, in run_harness
    output = run_command(cmd, get_output=True, custom_env=self.env_vars)
  File "/work/code/common/__init__.py", line 64, in run_command
    raise subprocess.CalledProcessError(ret, cmd)
subprocess.CalledProcessError: Command './build/bin/harness_dlrm --plugins="build/plugins/DLRMInteractionsPlugin/libdlrminteractionsplugin.so" --logfile_outdir="/work/build/logs/2022.08.25-22.23.12/A100-PCIex1_TRT/dlrm-99.9/Offline" --logfile_prefix="mlperf_log_" --performance_sample_count=204800 --test_mode="PerformanceOnly" --gpu_batch_size=315000 --tensor_path="build/preprocessed_data/criteo/full_recalib/numeric_int8_chw4.npy,build/preprocessed_data/criteo/full_recalib/categorical_int32.npy" --use_graphs=false --gpu_copy_streams=1 --complete_threads=1 --sample_partition_path="build/preprocessed_data/criteo/full_recalib/sample_partition.npy" --gpu_inference_streams=1 --num_staging_threads=8 --num_staging_batches=8 --max_pairs_per_staging_thread=262100 --gpu_num_bundles=2 --check_contiguity=true --gpu_engines="./build/engines/A100-PCIex1/dlrm/Offline/dlrm-Offline-gpu-b315000-int8.custom_k_99_9_MaxP.plan" --mlperf_conf_path="measurements/A100-PCIex1_TRT/dlrm-99.9/Offline/mlperf.conf" --user_conf_path="measurements/A100-PCIex1_TRT/dlrm-99.9/Offline/user.conf" --scenario Offline --model dlrm' returned non-zero exit status 134.
Traceback (most recent call last):
  File "code/main.py", line 772, in <module>
    main(main_args, DETECTED_SYSTEM)
  File "code/main.py", line 744, in main
    dispatch_action(main_args, config_dict, workload_id, equiv_engine_setting=equiv_engine_setting)
  File "code/main.py", line 574, in dispatch_action
    handle_run_harness(benchmark_conf, need_gpu, need_dla, profile, power)
  File "code/main.py", line 312, in handle_run_harness
    raise RuntimeError("Run harness failed!")
RuntimeError: Run harness failed!
make: *** [Makefile:699: run_harness] Error 1_

Do you know what could be my issues?

Thanks in advance

nv-ananjappa commented 1 year ago

Since the 2.1 results and code are already available - could you check if you still have issues with 2.1 code? https://github.com/mlcommons/inference_results_v2.1

quic-nmorillo commented 1 year ago

@nv-ananjappa I'm rerunning dlrm pre-processing step. I think the previous dlrm preprocessed data were corrupted somehow. I'll keep you updated. Thanks for responding!

quic-nmorillo commented 1 year ago

After redoing the pre-proccessing, I'm able to reproduce the DLRM results on my system. Closing issue !