mlcommons / inference

Reference implementations of MLPerf™ inference benchmarks
https://mlcommons.org/en/groups/inference
Apache License 2.0
1.17k stars 517 forks source link

Getting a lower than expected accuracy with DLRMv2 using Nvidia code for v3.1 and v4.0 #1619

Open arjunsuresh opened 6 months ago

arjunsuresh commented 6 months ago

While running the Nvidia code for DLRMv2 on a 4090 GPU with batch size 1400, we are seeing the below accuracy which is lower than expected. Can someone help us if we are missing something? We have tried on 2 systems (both with 4090) and using the same as well as modified Nvidia docker containers and getting the same result everywhere. Will try to run on a different GPU.

Assuming loadgen accuracy log does not contain ground truth labels.
Parsing loadgen accuracy log...
Parsing aggregation trace file...
Parsing ground truth labels from day_23 file...
Re-ordering ground truth labels...
Calculating AUC metric...
AUC=62.297%, accuracy=96.586%, good=86094226, total=89137319, queries=330067

Full console log is here:

CMD: make run_harness RUN_ARGS=' --benchmarks=dlrm-v2 --scenarios=offline  --test_mode=AccuracyOnly  --user_conf_path=/home/cmuser/CM/repos/ctuning@mlcommons-ck/cm-mlops/script/generate-mlperf-inference-user-conf/tmp/b774d577de3642b6b4aeebabc87f8b35.conf --mlperf_conf_path=/home/cmuser/CM/repos/local/cache/b49e297734004b90/inference/mlperf.conf --gpu_batch_size=1400 --embedding_weights_on_gpu_part=0.30 --no_audit_verify  '

       ! cd /home/cmuser
       ! call tmp-run.sh from /home/cmuser/CM/repos/ctuning@mlcommons-ck/cm-mlops/script/benchmark-program/run-ubuntu.sh
[2024-02-09 01:31:33,829 main.py:230 INFO] Detected system ID: KnownSystem.phoenix
[2024-02-09 01:31:35,168 harness.py:85 INFO] Found coalesced sparse input file.
[2024-02-09 01:31:35,168 harness.py:110 INFO] Found sample partition file.
[2024-02-09 01:31:35,168 harness.py:238 INFO] The harness will load 1 plugins: ['build/plugins/DLRMv2EmbeddingLookupPlugin/libdlrmv2embeddinglookupplugin.so']
[2024-02-09 01:31:35,169 generate_conf_files.py:107 INFO] Generated measurements/ entries for phoenix_TRT/dlrm-v2-99/Offline
[2024-02-09 01:31:35,169 __init__.py:46 INFO] Running command: ./build/bin/harness_dlrm_v2 --plugins="build/plugins/DLRMv2EmbeddingLookupPlugin/libdlrmv2embeddinglookupplugin.so" --logfile_outdir="/home/cmuser/valid_results/4f8759aa8c48-nvidia_original-gpu-tensorrt-vdefault-default_config/dlrm-v2-99/offline/accuracy" --logfile_prefix="mlperf_log_" --performance_sample_count=204800 --test_mode="AccuracyOnly" --gpu_batch_size=1400 --mlperf_conf_path="/home/cmuser/CM/repos/local/cache/b49e297734004b90/inference/mlperf.conf" --tensor_path="/home/mlperf_inf_dlrmv2/criteo/day23/fp32/day_23_dense.npy,/home/mlperf_inf_dlrmv2/criteo/day23/fp32/day_23_sparse_concatenated.npy" --use_graphs=false --user_conf_path="/home/cmuser/CM/repos/ctuning@mlcommons-ck/cm-mlops/script/generate-mlperf-inference-user-conf/tmp/b774d577de3642b6b4aeebabc87f8b35.conf" --gpu_copy_streams=1 --sample_partition_path="/home/mlperf_inf_dlrmv2/criteo/day23/sample_partition.npy" --gpu_inference_streams=1 --gpu_num_bundles=2 --check_contiguity=true --gpu_engines="./build/engines/phoenix/dlrm-v2/Offline/dlrm-v2-Offline-gpu-b1400-fp16.custom_k_99_MaxP.plan" --scenario Offline --model dlrm-v2
[2024-02-09 01:31:35,169 __init__.py:53 INFO] Overriding Environment
libnvrtc.so.11.2: cannot open shared object file: No such file or directory
benchmark : Benchmark.DLRMv2
buffer_manager_thread_count : 0
check_contiguity : True
coalesced_tensor : True
data_dir : /home/cmuser/local/cache/6057f3cefd9041b3/data
embedding_weights_on_gpu_part : 0.3
gpu_batch_size : 1400
gpu_copy_streams : 1
gpu_inference_streams : 1
gpu_num_bundles : 2
input_dtype : fp32
input_format : linear
log_dir : /home/cmuser/CM/repos/local/cache/2edf9ab842284e5b/repo/closed/NVIDIA/build/logs/2024.02.09-01.31.33
mega_table_npy_file : /home/mlperf_inf_dlrmv2/model/embedding_weights/mega_table_fp16.npy
mlperf_conf_path : /home/cmuser/CM/repos/local/cache/b49e297734004b90/inference/mlperf.conf
model_path : /home/mlperf_inf_dlrmv2/model/model_weights
offline_expected_qps : 0.0
precision : fp16
preprocessed_data_dir : /home/cmuser/local/cache/6057f3cefd9041b3/preprocessed_data
reduced_precision_io : True
sample_partition_path : /home/mlperf_inf_dlrmv2/criteo/day23/sample_partition.npy
scenario : Scenario.Offline
system : SystemConfiguration(host_cpu_conf=CPUConfiguration(layout={CPU(name='AMD Ryzen 9 7950X 16-Core Processor', architecture=<CPUArchitecture.x86_64: AliasedName(name='x86_64', aliases=(), patterns=())>, core_count=16, threads_per_core=2): 1}), host_mem_conf=MemoryConfiguration(host_memory_capacity=Memory(quantity=131.07376399999998, byte_suffix=<ByteSuffix.GB: (1000, 3)>, _num_bytes=131073764000), comparison_tolerance=0.05), accelerator_conf=AcceleratorConfiguration(layout=defaultdict(<class 'int'>, {GPU(name='NVIDIA GeForce RTX 4090', accelerator_type=<AcceleratorType.Discrete: AliasedName(name='Discrete', aliases=(), patterns=())>, vram=Memory(quantity=23.98828125, byte_suffix=<ByteSuffix.GiB: (1024, 3)>, _num_bytes=25757220864), max_power_limit=450.0, pci_id='0x268410DE', compute_sm=89): 1})), numa_conf=None, system_id='phoenix')
tensor_path : /home/mlperf_inf_dlrmv2/criteo/day23/fp32/day_23_dense.npy,/home/mlperf_inf_dlrmv2/criteo/day23/fp32/day_23_sparse_concatenated.npy
test_mode : AccuracyOnly
use_graphs : False
user_conf_path : /home/cmuser/CM/repos/ctuning@mlcommons-ck/cm-mlops/script/generate-mlperf-inference-user-conf/tmp/b774d577de3642b6b4aeebabc87f8b35.conf
system_id : phoenix
config_name : phoenix_dlrm-v2_Offline
workload_setting : WorkloadSetting(HarnessType.Custom, AccuracyTarget.k_99, PowerSetting.MaxP)
optimization_level : plugin-enabled
use_cpu : False
use_inferentia : False
num_profiles : 1
config_ver : custom_k_99_MaxP
accuracy_level : 99%
inference_server : custom
skip_file_checks : False
power_limit : None
cpu_freq : None
&&&& RUNNING DLRMv2_HARNESS # ./build/bin/harness_dlrm_v2
I0209 01:31:35.219956  4475 main_dlrm_v2.cpp:146] Found 1 GPUs
I0209 01:31:35.226480  4475 main_dlrm_v2.cpp:190] Loaded 330067 sample partitions. (1320272) bytes.
I0209 01:32:01.832041  4475 dlrm_v2_qsl.h:47] PerformanceSampleCount: 204800
I0209 01:32:01.832060  4475 dlrm_v2_qsl.h:48] TotalSampleCount: 330067 (89137319 pairs).
I0209 01:32:01.832067  4475 dlrm_v2_server.cpp:342] Using 1 DLRMv2 Core(s) per Device
I0209 01:32:01.832258  4479 dlrm_v2_server.cpp:747] Deserializing Engine on GPU#0
[I] [TRT] Loaded engine size: 31 MiB
[I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +7, GPU +10, now: CPU 70, GPU 995 (MiB)
[I] [TRT] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 71, GPU 1005 (MiB)
Starting plugin init...
Loading embedding weights...
Completed loading embedding weights...
Completed plugin init
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +30, now: CPU 0, GPU 30 (MiB)
I0209 01:37:59.007146  4479 dlrm_v2_server.cpp:754] Engine - Device Memory requirements: 38707200
I0209 01:37:59.007882  4479 dlrm_v2_server.cpp:755] Engine - Number of Optimization Profiles: 1
[E] [TRT] 3: [runtime.cpp::~Runtime::399] Error Code 3: API Usage Error (Parameter check failed at: runtime/rt/runtime.cpp::~Runtime::399, condition: mEngineCounter.use_count() == 1. Destroying a runtime before destroying deserialized engines created by the runtime leads to undefined behavior.
)
[I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 40, GPU 16801 (MiB)
[I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 40, GPU 16809 (MiB)
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +37, now: CPU 0, GPU 67 (MiB)
I0209 01:37:59.029897  4479 dlrm_v2_server.cpp:80] Setting profile = 0
I0209 01:37:59.029909  4479 dlrm_v2_server.cpp:83] Context creation complete
I0209 01:37:59.030673  4479 dlrm_v2_server.cpp:95] Created streams
I0209 01:37:59.030675  4479 dlrm_v2_server.cpp:102] Profile - Numeric Input Volume: 13
I0209 01:37:59.030676  4479 dlrm_v2_server.cpp:104] Profile - Categorical Input Volume: 214
I0209 01:37:59.030678  4479 dlrm_v2_server.cpp:106] Profile - Output Volume: 1
I0209 01:37:59.030993  4479 dlrm_v2_server.cpp:121] Created copy streams and buffers
I0209 01:37:59.030997  4479 dlrm_v2_server.cpp:122] Setup complete
I0209 01:37:59.031443  4479 dlrm_v2_server.cpp:292] Running warmup for 1s.
I0209 01:38:00.031769  4479 dlrm_v2_server.cpp:304] Warmup complete, ran for 1.00011s.
I0209 01:38:00.038967  4475 batch_maker.cpp:189] Contiguity-Aware H2H : ON
I0209 01:38:00.042248  4475 main_dlrm_v2.cpp:275] Starting running actual test.
I0209 01:38:00.049577  4475 dlrm_v2_qsl.h:230] Calling LoadSamplesToRam() for QSL ensemble...
I0209 01:38:00.049583  4475 dlrm_v2_qsl.h:70] Calling LoadSamplesToRam() for QSL[0] of 204800 samples...
I0209 01:53:32.917402  4475 dlrm_v2_qsl.h:142] Completed LoadSamplesToRam() for QSL[0]
I0209 01:53:33.258143  4475 dlrm_v2_qsl.h:235] Completed LoadSamplesToRam() for QSL ensemble.
I0209 01:55:45.730984  4475 dlrm_v2_qsl.h:239] Calling UnloadSamplesFromRam() for QSL ensemble...
I0209 01:55:45.731014  4475 dlrm_v2_qsl.h:147] Calling UnloadSamplesFromRam() for QSL[0] of 204800 samples...
I0209 01:55:45.731020  4475 dlrm_v2_qsl.h:152] Completed UnloadSamplesFromRam() for QSL[0]
I0209 01:55:45.731021  4475 dlrm_v2_qsl.h:244] Completed UnloadSamplesFromRam() for QSL ensemble.
I0209 01:55:45.731024  4475 dlrm_v2_qsl.h:230] Calling LoadSamplesToRam() for QSL ensemble...
I0209 01:55:45.731024  4475 dlrm_v2_qsl.h:70] Calling LoadSamplesToRam() for QSL[0] of 125267 samples...
I0209 02:04:25.198076  4475 dlrm_v2_qsl.h:142] Completed LoadSamplesToRam() for QSL[0]
I0209 02:04:25.351019  4475 dlrm_v2_qsl.h:235] Completed LoadSamplesToRam() for QSL ensemble.
I0209 02:05:46.487170  4475 dlrm_v2_qsl.h:239] Calling UnloadSamplesFromRam() for QSL ensemble...
I0209 02:05:46.487195  4475 dlrm_v2_qsl.h:147] Calling UnloadSamplesFromRam() for QSL[0] of 125267 samples...
I0209 02:05:46.487200  4475 dlrm_v2_qsl.h:152] Completed UnloadSamplesFromRam() for QSL[0]
I0209 02:05:46.487200  4475 dlrm_v2_qsl.h:244] Completed UnloadSamplesFromRam() for QSL ensemble.
I0209 02:05:46.502019  4475 main_dlrm_v2.cpp:280] Finished running actual test.
I0209 02:05:46.502751  4490 batch_maker.cpp:320] GetBatch Done

No warnings encountered during test.

No errors encountered during test.

Preprocessed dataset looks fine as the checksum is matching the required

arjun@phoenix-Amd-Am5:/dlrm/criteo_23$ md5sum day_23_sparse_multi_hot.npz
c46b7e31ec6f2f8768fa60bdfc0f6e40  day_23_sparse_multi_hot.npz
arjunsuresh commented 6 months ago

@nv-ananjappa Can you please help?

mrasquinha-g commented 6 months ago

As discussed in the WG meeting today. Please reach out to Ashwin.

kkkparty commented 4 months ago

the same auc to you, let me konw what is the problem?

arjunsuresh commented 4 months ago

@kkkparty are you running on 4090 as well? We got no reply from Nvidia yet.

kkkparty commented 4 months ago

No,i ran on 3090,and i cut piece of day23 for auc varify.get 63%,further,i can see any dense weights loaded.can u?

---Original--- From: "Arjun @.> Date: Tue, Apr 23, 2024 21:16 PM To: @.>; Cc: @.**@.>; Subject: Re: [mlcommons/inference] Getting a lower than expected accuracy withDLRMv2 using Nvidia code for v3.1 (Issue #1619)

@kkkparty are you running on 4090 as well? We got no reply from Nvidia yet.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

arjunsuresh commented 1 month ago

We originally had an issue with the preprocessed data which is now fixed and we get the expected accuracy with Intel submissions using the same dataset. But we still have the issue with Nvidia code.

Nvidia v4.0 code on 2xRTX 4090

[2024-07-10 09:56:32,793 run_harness.py:166 INFO] Result: Accuracy run detected.
[2024-07-10 09:56:32,806 __init__.py:46 INFO] Running command: PYTHONPATH=/work:/usr/lib/python38.zip:/usr/lib/python3.8:/usr/lib/python3.8/lib-dynload:/usr/local/lib/python3.8/dist-packages:/usr/lib/python3/dist-packages:/usr/lib/python3.8/dist-packages python3 -S /work/build/inference/recommendation/dlrm_v2/pytorch/tools/accuracy-dlrm.py --mlperf-accuracy-file /work/build/logs/2024.07.10-09.46.34/spr_TRT/dlrm-v2-99/Offline/mlperf_log_accuracy.json --day-23-file /home/mlperf_inf_dlrmv2/criteo/day23/raw_data --aggregation-trace-file /home/mlperf_inf_dlrmv2/criteo/day23/sample_partition.txt --dtype float32
Assuming loadgen accuracy log does not contain ground truth labels.
Parsing loadgen accuracy log...
Parsing aggregation trace file...
Parsing ground truth labels from day_23 file...
Re-ordering ground truth labels...
Calculating AUC metric...
AUC=67.515%, accuracy=96.586%, good=86093981, total=89137319, queries=330067

======================== Result summaries: ========================

 spr_TRT-custom_k_99_MaxP-Offline:
   dlrm-v2-99:
     accuracy: [FAILED] AUC: 67.515 (Threshold=79.507)

Nvidia v3.1 code on 1xRTX 4090

[2024-07-10 00:31:07,727 run_harness.py:170 INFO] Result: Accuracy run detected.
[2024-07-10 00:31:07,784 __init__.py:46 INFO] Running command: PYTHONPATH=/work:/usr/lib/python38.zip:/usr/lib/python3.8:/usr/lib/python3.8/lib-dynload:/home/arjun/.local/lib/python3.8/site-packages:/usr/local/lib/python3.8/dist-packages:/usr/lib/python3/dist-packages:/usr/lib/python3.8/dist-packages python3 -S /work/build/inference/recommendation/dlrm_v2/pytorch/tools/accuracy-dlrm.py --mlperf-accuracy-file /work/build/logs/2024.07.09-23.00.04/phoenix_TRT/dlrm-v2-99/Offline/mlperf_log_accuracy.json --day-23-file /home/mlperf_inf_dlrmv2/criteo/day23/raw_data --aggregation-trace-file /home/mlperf_inf_dlrmv2/criteo/day23/sample_partition.txt --dtype float32
Assuming loadgen accuracy log does not contain ground truth labels.
Parsing loadgen accuracy log...
Parsing aggregation trace file...
Parsing ground truth labels from day_23 file...
Re-ordering ground truth labels...
Calculating AUC metric...
AUC=71.568%, accuracy=96.587%, good=86094780, total=89137319, queries=330067

======================== Result summaries: ========================

 phoenix_TRT-custom_k_99_MaxP-Offline:
   dlrm-v2-99:
     accuracy: [FAILED] AUC: 71.568 (Threshold=79.507)

Nvidia code actually requires running this script to generate the frequency data. But this step is not documented anywhere and not sure if this script is the actual expected one.

arjunsuresh commented 1 month ago

We tried the same on a 8xH100 system and getting AUC around ~63%.

nv-ananjappa commented 1 month ago

@viraatc Please look into this if you have spare time before the v4.1 submission deadline. Else we can investigate after the deadline.

viraatc commented 1 month ago

Nvidia code actually requires running this script to generate the frequency data. But this step is not documented anywhere and not sure if this script is the actual expected one.

@arjunsuresh you are correct, this script needs to be run. We are missing this in docs for v4.0. My apologies for this. please let us know if u still encounter issues after running it within container as such before your run:

$ python -m "code.dlrm-v2.tensorrt.scripts.gen_frequency_data"

for v4.1 we have inlined this step as part of engine build, so there will be no additional steps / documentation.

arjunsuresh commented 1 month ago

Thank you @viraatc for checking. We also ran that script inside the container and that script produced the required npy file. But accuracy was still low.

Are we required to do "make calibrate" for Dlrmv2? We tried that but that was giving error with v4.0 code.

viraatc commented 1 month ago

Are we required to do "make calibrate" for Dlrmv2? We tried that but that was giving error with v4.0 code.

Yes, that is required after running gen frequency data script. Please share the error you face when you run make calibrate here, I will investigate this after v4.1 submission deadline.

arjunsuresh commented 1 month ago

Thank you @viraatc for confirming. The error during calibration is added below.

[V] Input tensor: sparse_input (dtype=DataType.INT32, shape=(256, 214)) | Setting input tensor shapes to: (min=[256, 214], opt=[256, 214], max=[256, 214])
[I] Using calibration profile: {numerical_input [min=[256, 13, 1, 1], opt=[256, 13, 1, 1], max=[256, 13, 1, 1]],
     sparse_input [min=[256, 214], opt=[256, 214], max=[256, 214]]}
[V] Input tensor: numerical_input (dtype=DataType.FLOAT, shape=(256, 13, 1, 1)) | Setting input tensor shapes to: (min=[256, 13, 1, 1], opt=[256, 13, 1, 1], max=[256, 13, 1, 1])
[V] Input tensor: sparse_input (dtype=DataType.INT32, shape=(256, 214)) | Setting input tensor shapes to: (min=[256, 214], opt=[256, 214], max=[256, 214])
[V] Loaded Module: numpy | Version: 1.23.5 | Path: ['/home/cmuser/.local/lib/python3.8/site-packages/numpy']
[V] Loaded Module: onnx | Version: 1.16.1 | Path: ['/home/cmuser/.local/lib/python3.8/site-packages/onnx']
[I] Building engine with configuration:
    Flags                  | [INT8]
    Engine Capability      | EngineCapability.DEFAULT
    Memory Pools           | [WORKSPACE: 80994.94 MiB, TACTIC_DRAM: 80994.94 MiB]
    Tactic Sources         | [CUBLAS, CUDNN, EDGE_MASK_CONVOLUTIONS, JIT_CONVOLUTIONS]
    Profiling Verbosity    | ProfilingVerbosity.DETAILED
    Preview Features       | [FASTER_DYNAMIC_SHAPES_0805, DISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805]
    Calibrator             | Calibrator(<generator object DLRMv2.calibrate.<locals>.data_loader at 0x7f0b69ed1f20>, cache=PosixPath('code/dlrm-v2/tensorrt/calibrator.cache'), BaseClass=<class 'tensorrt.tensorrt.IInt8EntropyCalibrator2'>, batch_size=256)
[I] Loading calibration cache from code/dlrm-v2/tensorrt/calibrator.cache
[V] Reading Calibration Cache for calibrator: EntropyCalibration2
[V] Generated calibration scales using calibration cache. Make sure that calibration cache has latest scales.
[V] To regenerate calibration cache, please delete the existing one. TensorRT will generate a new calibration cache.
[W] Missing scale and zero-point for tensor sparse_input, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[V] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +6, GPU +66, now: CPU 159813, GPU 1761 (MiB)
[V] [MemUsageChange] Init cuDNN: CPU +4, GPU +72, now: CPU 159817, GPU 1833 (MiB)
[V] Global timing cache in use. Profiling results in this builder pass will be stored.
[V] Detected 2 inputs and 1 output network tensors.
[V] Total Host Persistent Memory: 85440
[V] Total Device Persistent Memory: 0
[V] Total Scratch Memory: 219136
[V] [BlockAssignment] Started assigning block shifts. This will take 39 steps to complete.
[V] [BlockAssignment] Algorithm ShiftNTopDown took 0.632123ms to assign 7 blocks to 39 nodes requiring 6194688 bytes.
[V] Total Activation Memory: 6193152
[V] Total Weights Memory: 16327680
[V] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +64, now: CPU 159956, GPU 1959 (MiB)
[V] [MemUsageChange] Init cuDNN: CPU +1, GPU +72, now: CPU 159957, GPU 2031 (MiB)
Starting plugin init...
IO Config: 
    Dense Input:     FLOAT
    Sparse Input:    INT32
    Embedding Table: FLOAT
    Output:          FLOAT
Loading embedding weights...
WARNING: Logging before InitGoogleLogging() is written to STDERR
F0712 02:55:56.512785 15401 numpy.hpp:52] Check failed: m_FStream Unable to parse: /home/mlperf_inf_dlrmv2/model/embedding_weights/mega_table.npy
*** Check failure stack trace: ***
make: *** [Makefile:123: calibrate] Aborted (core dumped)
arjunsuresh commented 1 month ago

We created a softlink to mega_table_fp32.npy to get around the missing file error and regenerated the calibration.cache. But that too didn't help the accuracy.

[2024-07-12 04:18:44,188 run_harness.py:166 INFO] Result: Accuracy run detected.
[2024-07-12 04:18:44,189 __init__.py:46 INFO] Running command: python3 /home/cmuser/CM/repos/local/cache/28c3eeb6237c4e93/repo/closed/NVIDIA/build/inference/recommendation/dlrm_v2/pytorch/tools/accuracy-dlrm.py --mlperf-accuracy-file /home/cmuser/CM/repos/local/cache/6c8bb6e9615a4933/test_results/4709e5ccb133-nvidia_original-gpu-tensorrt-vdefault-default_config/dlrm-v2-99/offline/accuracy/mlperf_log_accuracy.json --day-23-file /home/mlperf_inf_dlrmv2/criteo/day23/raw_data --aggregation-trace-file /home/mlperf_inf_dlrmv2/criteo/day23/sample_partition.txt --dtype float32
Assuming loadgen accuracy log does not contain ground truth labels.
Parsing loadgen accuracy log...
Parsing aggregation trace file...
Parsing ground truth labels from day_23 file...
Re-ordering ground truth labels...
Calculating AUC metric...
AUC=62.914%, accuracy=96.586%, good=86094219, total=89137319, queries=330067
viraatc commented 2 weeks ago

thanks @arjunsuresh for all the details, I will be now looking at this.

let us first focus on v4.0-H100x1. assuming default config for v4.0-H100x1-Offline.

can you please follow the following steps and share respective files:

  1. ensure clean environment

    • run clean: make clean
    • run build: make build
  2. share md5 of all dlrm data folder:

    • from /home/mlperf_inf_dlrmv2, run find . -type f | xargs -P $(nproc) -I {} md5sum {} | tee md5_tree.log
    • share this newly generated: md5_tree.log
  3. share generated calibrator.cache file:

    • remove existing file: rm /work/code/dlrm-v2/tensorrt/calibrator.cache
    • generate new file: make calibrate RUN_ARGS="--benchmarks=dlrmv2"
    • share this newly generated: /work/code/dlrm-v2/tensorrt/calibrator.cache
    • NOTE: the symlink you made mega_table_fp32.npy -> mega_table.npy is still required for this to work.
  4. share fresh logs which use the generated calibrator.cache:

    • after step (2), please do a make run RUN_ARGS="--benchmarks=dlrmv2 --scenarios=offline --config_ver=default --test_mode=AccuracyOnly"
    • share the log folder as zip file.
arjunsuresh commented 2 weeks ago

Thank you @viraatc for looking into this. Below are the md5sum values of the dlrm data folder. I'll share the calibrator.cache file and logs soon. Meanwhile this is the MLCommons preprocessed criteo dataset which is used by Intel and others including us. Can you please confirm if this is the same one Nvidia is using?

a145bf8ba38d71137f8601876896d8d9  ./model/.snapshot_metadata
f1e1d545f964fb24d7e65fec86ca4020  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_9194304_0
ee2f4e147ae72e98baeb4c683876ecdd  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_39194304_0
80633f51d26f7aeed777c2f77d3c76c0  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_10.weight_2097152_0
548ef6312dbe5daeec9574d2b8e8b2fb  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_25000000_0
78477f306525dc7227eb9f438b24a5e7  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_21048576_0
e54c6d2993b511ad9eb7aedc70da931d  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_7097152_0
7b80ccfa3c056be76c3bde7141710b16  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_30000000_0
08bc8409e38cb92d433eb9aaabeaef1f  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_37097152_0
33c471827515f1101c20800ab22f3dd8  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_38145728_0
74535d7d559a463a90d17e02de5e2c79  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_15000000_0
392e7cfbeb3fe5c59a2fb52c340cd4ce  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_2097152_0
28c667f303e8c6da85a5a5614847170f  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_26048576_0
f7bb6ccbc4d6454b90f4d3d4f476d8ab  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_25000000_0
7dacf3a85889c6384b3a3b85efc25a95  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_1048576_0
fe211034447c9063d4b5141f2dfc6904  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_20000000_0
6b9e38777dcb03b3c379a88fb92bc2da  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_10000000_0
642dd6c9445d5d3fcb6fcc4670d9e395  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_7097152_0
a3fbc6af26a6dccea0b8d06d7f829ab7  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_32097152_0
949419855a540b778c186fa7e241ceda  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_1048576_0
a0f8ef9b4a4f14fe4211c07819f2fcf1  ./criteo/day23/fp32/day_23_sparse_multi_hot_unpacked/18.npy
41610d7810e5552ab5ecf0332e47a55e  ./criteo/day23/fp32/day_23_sparse_multi_hot_unpacked/7.npy
2e841abb6f8f7cd30e3e5ef5df855bf1  ./criteo/day23/fp32/day_23_sparse_multi_hot_unpacked/2.npy
a7650873137dc3518fa06d296d47df2b  ./criteo/day23/fp32/day_23_sparse_multi_hot_unpacked/8.npy
ced0a164f926f97a7501b047d3d05fad  ./criteo/day23/fp32/day_23_sparse_multi_hot_unpacked/17.npy
da0245108f14131171ac3d43418a100c  ./criteo/day23/fp32/day_23_sparse_multi_hot_unpacked/6.npy
cc7daf94cf81e89360f1273750b0a78a  ./criteo/day23/fp32/day_23_sparse_multi_hot_unpacked/24.npy
7bd8c842b7504c97e1078654d2e3a5c0  ./criteo/day23/fp32/day_23_sparse_multi_hot_unpacked/25.npy
ef346cd1ce26c7c85b6c4c108bdafaf0  ./criteo/day23/fp32/day_23_sparse_multi_hot_unpacked/5.npy
c3ae2edfb9c2279ec5e10e452226f661  ./criteo/day23/fp32/day_23_sparse_multi_hot_unpacked/16.npy
e9b71259c97546df1e9c82841f9c3d03  ./criteo/day23/fp32/day_23_sparse_multi_hot_unpacked/12.npy
dd68f93301812026ed6f58dfb0757fa7  ./criteo/day23/fp32/day_23_labels.npy
82adff98b58345f168b9ac5dd581eb4f  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_9194304_0
41a547605fb3247df77102ba2f7c7b2d  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_34194304_0
e06439a6f364da65faec13058aa70397  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_24194304_0
fcb38bb3766fe4ccb80dcedeb9a07dc7  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_18145728_0
30e1de64b8a384d2ccc8291d0935e7e7  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_14194304_0
04d791ff2c94d60fd0aa7c872e0a1b15  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_22.weight_0_0
af77b0ae84c53777c9ec203e4ff95590  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_12097152_0
e140d7dc8a2592c616151fc0f8d0aa6e  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_17097152_0
4c7f7170a6fdc20ac16668d6cf979cb4  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_2097152_0
3f624f10f2b8f8ec5f12f01aa2e52cb2  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_31048576_0
8b3a5356d633da8f0cad37cd41d24420  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_36048576_0
a7e49ca0f6ef4ece5d030ac7da550f54  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_31048576_0
e10540b2e8f1113bddceb9a6ea1a6743  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_20000000_0
669f05242472963bd614317046273efe  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_26048576_0
6fedfb7e6c12cb54fbdfd1a8653b03e5  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_0_0
b660c54389d1bd86afe26ca830d6ccbe  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_38145728_0
d369d35b7c9db97cc2d299d15ffcf399  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_28145728_0
49a78383f3b4f748fb98e4addfdbf414  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_25000000_0
4f774c3c4f90ff007353e0892708c7d9  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_3145728_0
e7e697a7fe4319274fde1e20f5a2caaf  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_37097152_0
2ca471d6813a0970d8ac10518bbd2b8f  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_9194304_0
3bc32c339b56b7d670572ef19d9c6e56  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_19194304_0
830e9aa0c756f1c8f81bcf39bc9f2e4e  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_19194304_0
ce01efbacf831f423546d0024b69ebc3  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_14194304_0
3d05ba8fe664115f9c753e34352cadb1  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_34194304_0
d9c25f227fa1147f99d2b238ac3ca931  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_34194304_0
c72983df6a02717aeb0208681fe5e9b8  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_2097152_0
55b7eee36dc3a711adb6ce165c05961c  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_18145728_0
01b35f69f6fe5497db425b0d63c4842c  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_13145728_0
31656faa5c8292694ccc5a60af9e8b5a  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_20000000_0
0586fb4f3ebeaf146cbf830f07d7d139  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_32097152_0
e67a63066a1ac73646ffbfb7e01ef0c1  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_13145728_0
a684ab288fe2bcc76374be0b0744fa2f  ./criteo/day23/fp32/day_23_sparse_multi_hot_unpacked/1.npy
e020aa1563b8b1a411405e420b322f49  ./criteo/day23/fp32/day_23_sparse_multi_hot_unpacked/3.npy
b424cd23564a5ae90167cdec5310a699  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_1048576_0
757441a8dbadcd597bf321343406245c  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_12097152_0
673d528813f90d96351f3be6ac60a376  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_17097152_0
80a328d0c168f910019b96e19b0f2670  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_33145728_0
14a13d00245609b987c492a1a2c9734b  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_4194304_0
67ffe3b29b331f57f7bde4c64e38584c  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_39194304_0
3d4b4649490b2f0e43d84b0cec565f60  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_10000000_0
9bc6f9afd5396c67c465f68dbdf3ad3d  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_0_0
37b667c524d11eea8379a70f980bea57  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_20000000_0
dfbf6b101f64d788dec3dc0f83683a7a  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_23145728_0
b7ed6bf6a4721d1c884c402944fcbaba  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_6048576_0
e1b38487b1aee957485c309c2e6d4f43  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_28145728_0
ff1910cc5e196cf8b95613fb3e1d6339  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_15000000_0
8473d8be8b56150ce900689d3dbcac58  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_35000000_0
55384f5a5bce56a7275c29be7479f2ed  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_1048576_0
489017aebea1f5a2502ba39bd4164624  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_0_0
14aa50070a4f96b0d48b3017f00adad1  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_28145728_0
aa7c3862b1e7bac4ac62a1080e3708b3  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_23145728_0
f4974323b69649b6de5f6cab296b3766  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_11048576_0
404e0ecf5afa0523b13365ff8140d1cf  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_33145728_0
63336a5a310ebaabb0d8fb8426c9a88c  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_17097152_0
01628e8cbd20d7d77bff869e6836c82a  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_5000000_0
9ce29a1baae29fb241805a81746dbf05  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_7097152_0
7ce4af0bd1ffd605117fca627585596f  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_26048576_0
497bcfa5bb579cbdf209465f2b9c8e9b  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_33145728_0
fe2229fae301f145c64c69ffa98d1c3c  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_14194304_0
8eec5fbca7218561eb854b8454f2e2a9  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_26048576_0
d6f9b8070963ef2327df0983b88bdace  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_11048576_0
7a343602f6fa7565d93a5d1c07d7a297  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_35000000_0
1ce2b74f8a2a7e4032d7a24c0e78d5ba  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_8145728_0
8a097aa79cfd23b19985568caec1341d  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_11048576_0
43abcf88605d0c11beb4d16a8cec5e51  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_4194304_0
17a85971189e7a317efbc366f8413a9a  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_34194304_0
e0fe2de6b858584cb2210897ade09f11  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_36048576_0
79da68ce871f84578e7c9a2c25ff29e1  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_19194304_0
9a82ba8cc0b40461df9f11a774025dbd  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_29194304_0
7ab30a6daf7212fd3706a82b1f36e04e  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_34194304_0
8584fa283032bdfc6ef47e8e22668393  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_14194304_0
ef6b61912fd821b32a40e6f534cfe441  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_22097152_0
f277090c42a21600b4ea4adcfe590200  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_5000000_0
1689acf98130c83cead84544ffc84bb1  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_15000000_0
88ccd51c5b3daa393ca2b51f7eb7bd03  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_0_0
1ced3f766a8a1325772a9d07707abaf2  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_21048576_0
e16b45a7a7127ab0fff4603a9ff9addf  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_31048576_0
3434833ecd8a225f3f405b812bb47944  ./criteo/day23/fp32/day_23_sparse_multi_hot_unpacked/23.npy
9b383e9ca2ad6d0841346b255f390a01  ./criteo/day23/fp32/day_23_sparse_multi_hot_unpacked/10.npy
a435220293e8e2b4c2b70267b759cf36  ./criteo/day23/fp32/day_23_sparse_multi_hot_unpacked/0.npy
ce2ab49978ebd5d5e57aefc9ff98620a  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_8145728_0
8eb257d3794a4ba1176257039152697c  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_10000000_0
161625ceeed2b289d54500e9b5c53cdf  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_10000000_0
6236372be1c8d44e64d5d361d573e069  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_3145728_0
8a50cee50e6a29e1d4bc6434c043a3ff  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_3145728_0
72e4d062c8c82b058d990c510b19801f  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_28145728_0
b4309e437817166a3499bffb9526e596  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_32097152_0
ce113ed2332d3ab2ae3d7b8659af9743  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_6048576_0
648054b5342ecd82875db9e7e30266cd  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_2097152_0
72bb8ede1c6eff8760844786bef5f21f  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_12097152_0
53b6face525eaec2651c6cbd804d99bb  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_37097152_0
8151a193c9da95286a77c1e368554679  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_39194304_0
8ea36f73b89a8f493ac65caae3cdedd3  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_5000000_0
bb81722dac6a6fb513e3ea375e4c1585  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_10.weight_0_0
4c8521eab11b84e1bbe09dc6a1597543  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_9194304_0
f45e2c17e1f2b9e7fa015491e6db8752  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_18145728_0
e2de2c95e21b974726c0663d8bda66ea  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_23145728_0
c0b05b9b59b840658ab65c7cbc842c9e  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_11048576_0
b69b590999f01d8a75813a80a8030632  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_22097152_0
a60ee56d34f187bb06295e633144f272  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_32097152_0
b246e077be7c8547cef5de46e3ba947b  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_17097152_0
7678e02c717dce559ecfcb0d804a56de  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_29194304_0
92f72a5af7fc9dd64e6b627bf6eca7bc  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_36048576_0
7322754f2dcd26a9555e5ad05a531211  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_38145728_0
fcac09e7f6fd8e1115b1c74b4798a8e6  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_21048576_0
336002538d79efe950432373ecdef2b0  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_19194304_0
a81ffafb5dd836aa030e23a86c392302  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_22097152_0
927b1c653c4b6d5dc079807a11df0750  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_35000000_0
4f916d854ff9b0d329b4a72a6713f8cd  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_24194304_0
0c637fb98c14d29037ac0b1bbcc9c9e8  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_28145728_0
040dff78e0558eb7209fe6fa891223e0  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_30000000_0
641cd6bb095f620c56835a5562c825bf  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_24194304_0
6e12338b1410c912f800b7c7b0b6aa87  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_6048576_0
a0a0e6dd38b14829f78ec5038d86c0ee  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_31048576_0
b817122ea8f4222c570b72904e885f1a  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_23145728_0
2fd34285bae3bca58e4f3959a9345ddd  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_9194304_0
37ceeb98ba435cf8a20fcf4ef0b02a2f  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_37097152_0
440e2b700359d36186ce74df6e947ca5  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_11048576_0
ba87ebf270fbcc6e4fc432c1287f1acc  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_32097152_0
dfc81cd41a5f61fce07e79068b1d7026  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_13145728_0
d0047d594e6b322a9ab05704cd5be9af  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_5000000_0
9838ff6954bfce22eb4b890b09c268d1  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_8145728_0
3edfc0ae1100a32474ec17ad243222bb  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_22097152_0
1147662f4e25cf6c24ceb53f2bbff25b  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_36048576_0
cc18545eff864549413987d071fa4130  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_29194304_0
c06759a9e702f50618849f66d921eb71  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_7097152_0
e05427946ba766b69e4729b036f88eec  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_6048576_0
37628bc6d421be15ccf3f32c225e9909  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_31048576_0
d54977419bf849726f9d5cf39a299719  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_1048576_0
6895e65f92ead8186100662246960613  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_16048576_0
a781d6704f982dcdc36d3fe7f7de4cdd  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_35000000_0
1c9b816a226ef799fdb3bcde748750aa  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_10.weight_1048576_0
ee66fa72d83e5aec27b43729991530e9  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_16048576_0
77db2a8a848967cd1b8af0d741120436  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_29194304_0
54ca200441e7f7a72d0617acf9e550a1  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_11.weight_0_0
6a8a6797a0b11afd37baa36dfd7c897b  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_14194304_0
65f37a1da03070b16a8359dc3a7dc178  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_12097152_0
296885f6440cd96cc89058ce8d182970  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_27097152_0
56b9a4972c7bb472851b227aec9d47cf  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_6048576_0
5b7b9fb5ec058050b17cf1d6fb721b36  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_17097152_0
e42e5443a8751da7f806ebb7f7f9cb14  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_24194304_0
c8c5d9ea569fe1cf2a191b60195d9278  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_30000000_0
e9c3236a9165022eed33f4e9cb32817e  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_27097152_0
2f0bd5c401b2f00977596425876758f7  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_18145728_0
61be79da919b70498adaa8d24555b30f  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_39194304_0
331c11fee6174f6ebfcf88a323b916bd  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_23145728_0
a6562f3c93f617c59beeb885a591dbf3  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_33145728_0
61c0da9e01e2d6f7ff36e51c8226bc57  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_36048576_0
573b6c53f2be28ea64b44432f041b2a4  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_20000000_0
110359665cc240444ceaf866b339c830  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_25000000_0
e7eaeb361f9398e82bf590e43e797359  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_8145728_0
f31d5913d1c0484ff05ce4abf9aeb02e  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_39194304_0
8e3e79875ff4a446a31eb7c39840a5f3  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_26048576_0
28a8d7a858c38a840718f017cdae1918  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_2097152_0
03d67a62673c092eff9d7e7f7e13bd71  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_4194304_0
a145bf8ba38d71137f8601876896d8d9  ./model/model_weights/.snapshot_metadata
25ddf7f4f2c8584ce5ccceb9ea68a23e  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_27097152_0
02cfba7b5b989b57fa242c6e12d0fe5a  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_4194304_0
c34ef2a88ef21c015a5c41cd5174f526  ./model/model_weights/batched/eac7f4b6-b9c3-475a-a3ff-cde4fedc6802
25ba3700b1b01da3d68160ae0c4fc870  ./model/model_weights/batched/7e00b2da-abef-415b-ad92-7e1c416e3add
c1dc6791e03b5090e75db5598a92dd0a  ./model/model_weights/batched/36b28946-e72f-490d-8ab2-cba1649573f8
f4ea38088ce72a56a3d0d2ac7eef47e9  ./model/model_weights/batched/bb747bc7-64ec-4c3d-afb7-d85ec3869692
60d9745b69278c98585aeb5cb87eaa13  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_24194304_0
1c8e6a1b9811c135473c7b2fb82e2d8d  ./model/model_weights/batched/1eb3adc8-ac96-410f-842f-809ac1f9a38f
64ab8a3273105ca1fa136f53c43a596b  ./model/model_weights/batched/6795741c-62c3-4383-9518-be0b2f937083
bc0f2f2ef13b8b93fca2069d494021cc  ./model/model_weights/batched/c20a5f66-eef2-4c52-bd9f-941d6117fa6e
a18e5255e34d2bb7157000e5e4bb5d03  ./model/model_weights/batched/a96647f1-5f9d-4269-a513-951187353c89
a11e6f3cc0806d6265f9f2b943a6c62e  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_13145728_0
66f753d18943b8965c68e3833fe0494a  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_13145728_0
67e81be3c4821e837e4f8c5b9241283a  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_12097152_0
5c9b0f77920b41373cbb69dfd8b7da79  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_22097152_0
3b1cfc9654f42ce4f491c7295d41ac5e  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_37097152_0
0d42564d2b1e1c5db86433c96f8e6338  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_10000000_0
ddba8f13390d15b75e183edf9db99966  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_27097152_0
db5c23cabb631fed54149df34f69d518  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_38145728_0
faeb52ac2524dd748aebc1e0ad3d920c  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_27097152_0
99b528650956989af10450a7d5de194e  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_16048576_0
67c36c043cc318a7cd75801f575b4318  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_16048576_0
26e24e807b654f89ebbcb9dcdf1ff53a  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_3145728_0
d14b78e40dbc374c7541a89dcb989883  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_21048576_0
bf89130a5dcb11183de868684ffebf95  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_29194304_0
139e7fea9b70ed3aba3effca7a3c8612  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_18145728_0
9c34bbfdb4b9c632f7ce938abed1d39e  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_15000000_0
6529e42d90358a9ca88e5be34ca13bd9  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_8145728_0
4d3a38f17b00272fac92e73e14cc4f4d  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_3145728_0
b83080f49b43d74fabee20bf629738e2  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_38145728_0
e37470532b01f46e213eecb023f2c075  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_4194304_0
7106b291f2c7bc99553323ad8f12f4ab  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_19194304_0
32a5a9a6643c8d5d01f65c10156f1f7b  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_25000000_0
29b6319884c87eac2dc10e4670576bc5  ./criteo/day23/fp32/day_23_sparse_multi_hot_unpacked/15.npy
12b036c64fdd762fb72478ad4c5b76f2  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_35000000_0
70f0b853ffff32fef8318cb420cb1566  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_7097152_0
2ef9a3cc86ef7009208d13d17d14b072  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_15000000_0
662ae59409f54e143738aeb6b26c02ab  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_33145728_0
33515f75e36ad4deb25269084b397ac6  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_0_0
9748c2a81d508b92834c7c25d1cdcefb  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_30000000_0
d4a709eb503e524461e7b94de5a12263  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_5000000_0
be0e93562e96a66a6718026616c50a3a  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_21048576_0
806dfb03fa65e6a9e7e154fdc934b925  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_16048576_0
e2143d80b2a7860c80ab6ebc770daa2b  ./model/model_weights/sharded/model/model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_30000000_0
cabb6b0d784a9d74192ef029f53309d4  ./criteo/day23/fp32/day_23_sparse_multi_hot_unpacked/4.npy
4b8e79310e06168422e6aa7f1b66f6ae  ./criteo/day23/fp32/day_23_sparse_multi_hot_unpacked/13.npy
49acd882a1b742af1743922f9409fc1e  ./criteo/day23/fp32/day_23_sparse_multi_hot_unpacked/9.npy
b1ce2de05b791c1ddb36e0e573a75d93  ./criteo/day23/fp32/day_23_sparse_multi_hot_unpacked/11.npy
f9acdc32bd6b766358be846d34b7dd19  ./criteo/day23/fp32/day_23_sparse_multi_hot_unpacked/14.npy
45026929433aa879157e9b4f033c75b2  ./criteo/day23/fp32/day_23_sparse_multi_hot_unpacked/22.npy
0ab3a06e2b648cf574d1235d71ebb006  ./criteo/day23/fp32/day_23_sparse_multi_hot_unpacked/19.npy
cdf7af87cbc7e9b468c0be46b1767601  ./criteo/day23/fp32/day_23_dense.npy
3f8626a163420fc26c35c82b5b42e7ee  ./criteo/day23/fp32/day_23_sparse_multi_hot_unpacked/21.npy
08e251af4f3d1e8771ea15e405f39600  ./criteo/day23/raw_data
7c753b13d54ad9e3e6c5e73719622201  ./criteo/day23/fp32/day_23_sparse_multi_hot_unpacked/20.npy
arjunsuresh commented 2 weeks ago

@viraatc The below tar file has the logs and also the calibration.cache file https://drive.google.com/file/d/1wJG5wqzH3IP7pLE-jGTmKptLEwgtfxzN/view?usp=sharing

arjunsuresh commented 1 week ago

@viraatc Can you please confirm if the checksums are as expected?