Open wlhtjht opened 3 months ago
There is a typo in the docs website - it should be dlrm-v2-99
. But dlrm-v2 reference implementation is tested only on 8x80GB Nvidia GPUs. Would you like to try Intel or Nvidia implementation instead?
I prefer to do this on the Arm CPU. Of course, I will also try Intel and NV. The cmd changed to "dlrm-v2-99 " is still not ok, But intel is ok, at least no CM error. I will try manually, Thank you.
Oh okay. The reference implementation probably won't work out of the box on CPUs (it failed for us).
Meanwhile we are working on the script to update the dlrm scripts to use the MLCommons hosted criteo preprocessed dataset - should be ready by tomorrow. Without this, the current scripts are broken as criteo dataset needs to be downloaded manually.
curious if the script has been updated? We are eager to test
I downloaded the weights and data manually, but I must still be missing something:
cm run script --tags=get,ml-model,dlrm,_pytorch,_weight_sharded,_rclone -j
cm run script --tags=get,preprocessed,dataset,criteo,_multihot,_mlc -j
cm run script --tags=run-mlperf,inference,_find-performance,_full,_r4.1
--model=dlrm-v2-99
--implementation=reference
--framework=pytorch
--category=datacenter
--scenario=Offline
--execution_mode=test
--device=cpu
--docker
--quiet
--test_query_count=50
results in: KeyError: 'CM_DATASET_PREPROCESSED_PATH'
@howudodat yes, the preprocessing script is now working. The reference scripts needed many changes to get it working. Please do
cm pull repo
cm run script --tags=run-mlperf,inference,_full --model=dlrm-v2-99 --backend=pytorch --quiet --test_query_count=1000 --docker
In the second command you can avoid --docker
if you want to run on the host machine.
It is still in testing so there can be issues. The machine to run will need about 500GB of memory. We are testing on 128GB with 500GB swap.
If all goes well you should be seeing
./run_local.sh pytorch dlrm multihot-criteo cpu --scenario Offline --mlperf_conf '/home/cmuser/CM/repos/local/cache/7aecc037606c4784/inference/mlperf.conf' --max-ind-range=40000000 --samples-to-aggregate-quantile-file=./tools/dist_quantile.txt --user_conf '/home/cmuser/CM/repos/gateoverflow@cm4mlops/script/generate-mlperf-inference-user-conf/tmp/7b425bbfd8e34c99ae1dfdf49a5e8902.conf' 2>&1 ; echo $? > exitstatus | tee '/home/cmuser/CM/repos/local/cache/28c9ba49c42b4bd6/test_results/79971f578f12-reference-cpu-pytorch-v1.13.1-default_config/dlrm-v2-99/offline/performance/run_1/console.out'
+ python python/main.py --profile dlrm-multihot-pytorch --mlperf_conf ../../../mlperf.conf --model dlrm --model-path /home/cmuser/CM/repos/local/cache/1faad90b75984cb0/model_weights --dataset multihot-criteo --dataset-path /home/cmuser/CM/repos/local/cache/6e1079b9161e4ac2/dlrm_preprocessed --output /home/cmuser/CM/repos/local/cache/7aecc037606c4784/inference/recommendation/dlrm_v2/pytorch/output/pytorch-cpu/dlrm --scenario Offline --mlperf_conf /home/cmuser/CM/repos/local/cache/7aecc037606c4784/inference/mlperf.conf --max-ind-range=40000000 --samples-to-aggregate-quantile-file=./tools/dist_quantile.txt --user_conf /home/cmuser/CM/repos/gateoverflow@cm4mlops/script/generate-mlperf-inference-user-conf/tmp/7b425bbfd8e34c99ae1dfdf49a5e8902.conf
/home/cmuser/.local/lib/python3.10/site-packages/fbgemm_gpu/fbgemm_gpu_py.so: undefined symbol: _ZN2at4_ops10zeros_like4callERKNS_6TensorEN3c108optionalINS5_10ScalarTypeEEENS6_INS5_6LayoutEEENS6_INS5_6DeviceEEENS6_IbEENS6_INS5_12MemoryFormatEEE
INFO:main:Namespace(model='dlrm', model_path='/home/cmuser/CM/repos/local/cache/1faad90b75984cb0/model_weights', dataset='multihot-criteo', dataset_path='/home/cmuser/CM/repos/local/cache/6e1079b9161e4ac2/dlrm_preprocessed', profile='dlrm-multihot-pytorch', scenario='Offline', max_ind_range=40000000, max_batchsize=2048, output='/home/cmuser/CM/repos/local/cache/7aecc037606c4784/inference/recommendation/dlrm_v2/pytorch/output/pytorch-cpu/dlrm', inputs=['continuous and categorical features'], outputs=['probability'], backend='pytorch-native', use_gpu=False, threads=32, accuracy=False, find_peak_performance=False, mlperf_conf='/home/cmuser/CM/repos/local/cache/7aecc037606c4784/inference/mlperf.conf', user_conf='/home/cmuser/CM/repos/gateoverflow@cm4mlops/script/generate-mlperf-inference-user-conf/tmp/7b425bbfd8e34c99ae1dfdf49a5e8902.conf', duration=None, target_qps=None, max_latency=None, count_samples=None, count_queries=None, samples_per_query_multistream=8, samples_per_query_offline=2048, samples_to_aggregate_fix=None, samples_to_aggregate_min=None, samples_to_aggregate_max=None, samples_to_aggregate_quantile_file='./tools/dist_quantile.txt', samples_to_aggregate_trace_file='dlrm_trace_of_aggregated_samples.txt', numpy_rand_seed=123, debug=False)
Using CPU...
Using variable query size: custom distribution (file ./tools/dist_quantile.txt)
Loading model from /home/cmuser/CM/repos/local/cache/1faad90b75984cb0/model_weights
Initializing embeddings...
Initializing model...
Distributing the model...
WARNING:root:Could not determine LOCAL_WORLD_SIZE from environment, falling back to WORLD_SIZE.
INFO:torchrec.distributed.planner.proposers:Skipping grid search proposer as there are too many proposals.
Total proposals to search: 4.50e+15
Max proposals allowed: 10000
INFO:torchrec.distributed.planner.stats:###############################################################################################################################################################################################################
INFO:torchrec.distributed.planner.stats:# --- Planner Statistics --- #
INFO:torchrec.distributed.planner.stats:# --- Evaluated 81 proposal(s), found 81 possible plan(s), ran for 0.04s --- #
INFO:torchrec.distributed.planner.stats:# ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- #
INFO:torchrec.distributed.planner.stats:# Rank HBM (GB) DDR (GB) Perf (ms) Input (MB) Output (MB) Shards #
INFO:torchrec.distributed.planner.stats:# ------ ---------- ---------- ----------- ------------ ------------- -------- #
INFO:torchrec.distributed.planner.stats:# 0 0.0 (0%) 97.7 (76%) 0.774 0.1 6.5 TW: 26 #
INFO:torchrec.distributed.planner.stats:# #
INFO:torchrec.distributed.planner.stats:# Input: MB/iteration, Output: MB/iteration, Shards: number of tables #
INFO:torchrec.distributed.planner.stats:# HBM: estimated peak memory usage for shards, dense tensors, and features (KJT) #
INFO:torchrec.distributed.planner.stats:# #
INFO:torchrec.distributed.planner.stats:# Parameter Info: #
INFO:torchrec.distributed.planner.stats:# FQN Sharding Compute Kernel Perf (ms) Pooling Factor Number of Poolings Output Features Emb Dim Hash Size Ranks #
INFO:torchrec.distributed.planner.stats:# ----- ---------- ---------------- ----------- ---------------- -------------------- -------- ---------- -------- ----------- ------- #
INFO:torchrec.distributed.planner.stats:# model.sparse_arch.embedding_bag_collection.t_cat_0 TW fused 0.03 1.0 1.0 pooled 1 128 40000000 0 #
INFO:torchrec.distributed.planner.stats:# model.sparse_arch.embedding_bag_collection.t_cat_1 TW fused 0.03 1.0 1.0 pooled 1 128 39060 0 #
INFO:torchrec.distributed.planner.stats:# model.sparse_arch.embedding_bag_collection.t_cat_2 TW fused 0.03 1.0 1.0 pooled 1 128 17295 0 #
INFO:torchrec.distributed.planner.stats:# model.sparse_arch.embedding_bag_collection.t_cat_3 TW fused 0.03 1.0 1.0 pooled 1 128 7424 0 #
INFO:torchrec.distributed.planner.stats:# model.sparse_arch.embedding_bag_collection.t_cat_4 TW fused 0.03 1.0 1.0 pooled 1 128 20265 0 #
INFO:torchrec.distributed.planner.stats:# model.sparse_arch.embedding_bag_collection.t_cat_5 TW fused 0.03 1.0 1.0 pooled 1 128 3 0 #
INFO:torchrec.distributed.planner.stats:# model.sparse_arch.embedding_bag_collection.t_cat_6 TW fused 0.03 1.0 1.0 pooled 1 128 7122 0 #
INFO:torchrec.distributed.planner.stats:# model.sparse_arch.embedding_bag_collection.t_cat_7 TW fused 0.03 1.0 1.0 pooled 1 128 1543 0 #
INFO:torchrec.distributed.planner.stats:# model.sparse_arch.embedding_bag_collection.t_cat_8 TW fused 0.03 1.0 1.0 pooled 1 128 63 0 #
INFO:torchrec.distributed.planner.stats:# model.sparse_arch.embedding_bag_collection.t_cat_9 TW fused 0.03 1.0 1.0 pooled 1 128 40000000 0 #
INFO:torchrec.distributed.planner.stats:# model.sparse_arch.embedding_bag_collection.t_cat_10 TW fused 0.03 1.0 1.0 pooled 1 128 3067956 0 #
INFO:torchrec.distributed.planner.stats:# model.sparse_arch.embedding_bag_collection.t_cat_11 TW fused 0.03 1.0 1.0 pooled 1 128 405282 0 #
INFO:torchrec.distributed.planner.stats:# model.sparse_arch.embedding_bag_collection.t_cat_12 TW fused 0.03 1.0 1.0 pooled 1 128 10 0 #
INFO:torchrec.distributed.planner.stats:# model.sparse_arch.embedding_bag_collection.t_cat_13 TW fused 0.03 1.0 1.0 pooled 1 128 2209 0 #
INFO:torchrec.distributed.planner.stats:# model.sparse_arch.embedding_bag_collection.t_cat_14 TW fused 0.03 1.0 1.0 pooled 1 128 11938 0 #
INFO:torchrec.distributed.planner.stats:# model.sparse_arch.embedding_bag_collection.t_cat_15 TW fused 0.03 1.0 1.0 pooled 1 128 155 0 #
INFO:torchrec.distributed.planner.stats:# model.sparse_arch.embedding_bag_collection.t_cat_16 TW fused 0.03 1.0 1.0 pooled 1 128 4 0 #
INFO:torchrec.distributed.planner.stats:# model.sparse_arch.embedding_bag_collection.t_cat_17 TW fused 0.03 1.0 1.0 pooled 1 128 976 0 #
INFO:torchrec.distributed.planner.stats:# model.sparse_arch.embedding_bag_collection.t_cat_18 TW fused 0.03 1.0 1.0 pooled 1 128 14 0 #
INFO:torchrec.distributed.planner.stats:# model.sparse_arch.embedding_bag_collection.t_cat_19 TW fused 0.03 1.0 1.0 pooled 1 128 40000000 0 #
INFO:torchrec.distributed.planner.stats:# model.sparse_arch.embedding_bag_collection.t_cat_20 TW fused 0.03 1.0 1.0 pooled 1 128 40000000 0 #
INFO:torchrec.distributed.planner.stats:# model.sparse_arch.embedding_bag_collection.t_cat_21 TW fused 0.03 1.0 1.0 pooled 1 128 40000000 0 #
INFO:torchrec.distributed.planner.stats:# model.sparse_arch.embedding_bag_collection.t_cat_22 TW fused 0.03 1.0 1.0 pooled 1 128 590152 0 #
INFO:torchrec.distributed.planner.stats:# model.sparse_arch.embedding_bag_collection.t_cat_23 TW fused 0.03 1.0 1.0 pooled 1 128 12973 0 #
INFO:torchrec.distributed.planner.stats:# model.sparse_arch.embedding_bag_collection.t_cat_24 TW fused 0.03 1.0 1.0 pooled 1 128 108 0 #
INFO:torchrec.distributed.planner.stats:# model.sparse_arch.embedding_bag_collection.t_cat_25 TW fused 0.03 1.0 1.0 pooled 1 128 36 0 #
INFO:torchrec.distributed.planner.stats:# #
INFO:torchrec.distributed.planner.stats:# Batch Size: 512 #
INFO:torchrec.distributed.planner.stats:# #
INFO:torchrec.distributed.planner.stats:# Compute Kernels: #
INFO:torchrec.distributed.planner.stats:# fused: 26 #
INFO:torchrec.distributed.planner.stats:# #
INFO:torchrec.distributed.planner.stats:# Longest Critical Path: 0.774 ms on rank 0 #
INFO:torchrec.distributed.planner.stats:# #
INFO:torchrec.distributed.planner.stats:# Peak Memory Pressure: 0.0 GB on rank 0 #
INFO:torchrec.distributed.planner.stats:# #
INFO:torchrec.distributed.planner.stats:# Usable Memory: #
INFO:torchrec.distributed.planner.stats:# HBM: 0.0 GB, DDR: 128.0 GB #
INFO:torchrec.distributed.planner.stats:# Percent of Total HBM: 95% #
INFO:torchrec.distributed.planner.stats:# #
INFO:torchrec.distributed.planner.stats:# Dense Storage (per rank): #
INFO:torchrec.distributed.planner.stats:# HBM: 0.0 GB, DDR: 0.359 GB #
INFO:torchrec.distributed.planner.stats:# #
INFO:torchrec.distributed.planner.stats:# KJT Storage (per rank): #
INFO:torchrec.distributed.planner.stats:# HBM: 0.0 GB, DDR: 0.002 GB #
INFO:torchrec.distributed.planner.stats:###############################################################################################################################################################################################################
INFO:root:Using fused exact_sgd with optimizer_args=OptimizerArgs(stochastic_rounding=True, gradient_clipping=False, max_gradient=1.0, learning_rate=0.01, eps=1e-08, beta1=0.9, beta2=0.999, weight_decay=0.0, weight_decay_mode=0, eta=0.001, momentum=0.9)
Loading model weights...
INFO:torchsnapshot.scheduler:Set process memory budget to 15829337702 bytes.
INFO:torchsnapshot.scheduler:Rank 0 finished loading. Throughput: 4157.55MB/s
I am testing. one small issue is a pre-req was missing: unzip
/home/peter/CM/repos/mlcommons@cm4mlops/script/get-rclone/install.sh: line 9: unzip: command not found
a simple apt install unzip and it progresses
We'll fix the issue for unzip
. Thanks for letting us know. Please let us know if there is any other issue with the run.
I dont have 128G of memory to use, but I am not getting to the above point
CMD: ./run_local.sh pytorch dlrm multihot-criteo cpu --scenario Offline --mlperf_conf '/home/cmuser/CM/repos/local/cache/a14eb481b8b24a9f/inference/mlperf.conf' --max-ind-range=40000000 --samples-to-aggregate-quantile-file=./tools/dist_quantile.txt --user_conf '/home/cmuser/CM/repos/mlcommons@cm4mlops/script/generate-mlperf-inference-user-conf/tmp/d30dabe774784ae9b118c18ee413dc16.conf' 2>&1 ; echo \$? > exitstatus | tee '/home/cmuser/CM/repos/local/cache/d339e8c618184c86/test_results/f151395b64b0-reference-cpu-pytorch-v1.13.1-default_config/dlrm-v2-99/offline/performance/run_1/console.out'
DEBUG:root: - Running native script "/home/cmuser/CM/repos/mlcommons@cm4mlops/script/benchmark-program/run-ubuntu.sh" from temporal script "tmp-run.sh" in "/home/cmuser" ...
INFO:root: ! cd /home/cmuser
INFO:root: ! call /home/cmuser/CM/repos/mlcommons@cm4mlops/script/benchmark-program/run-ubuntu.sh from tmp-run.sh
./run_local.sh pytorch dlrm multihot-criteo cpu --scenario Offline --mlperf_conf '/home/cmuser/CM/repos/local/cache/a14eb481b8b24a9f/inference/mlperf.conf' --max-ind-range=40000000 --samples-to-aggregate-quantile-file=./tools/dist_quantile.txt --user_conf '/home/cmuser/CM/repos/mlcommons@cm4mlops/script/generate-mlperf-inference-user-conf/tmp/d30dabe774784ae9b118c18ee413dc16.conf' 2>&1 ; echo $? > exitstatus | tee '/home/cmuser/CM/repos/local/cache/d339e8c618184c86/test_results/f151395b64b0-reference-cpu-pytorch-v1.13.1-default_config/dlrm-v2-99/offline/performance/run_1/console.out'
+ python python/main.py --profile dlrm-multihot-pytorch --mlperf_conf ../../../mlperf.conf --model dlrm --model-path /home/cmuser/CM/repos/local/cache/2f38d7a46ac74efe/model_weights --dataset multihot-criteo --dataset-path /home/cmuser/CM/repos/local/cache/855ac16cc352468a/dlrm_preprocessed --output /home/cmuser/CM/repos/local/cache/a14eb481b8b24a9f/inference/recommendation/dlrm_v2/pytorch/output/pytorch-cpu/dlrm --scenario Offline --mlperf_conf /home/cmuser/CM/repos/local/cache/a14eb481b8b24a9f/inference/mlperf.conf --max-ind-range=40000000 --samples-to-aggregate-quantile-file=./tools/dist_quantile.txt --user_conf /home/cmuser/CM/repos/mlcommons@cm4mlops/script/generate-mlperf-inference-user-conf/tmp/d30dabe774784ae9b118c18ee413dc16.conf
INFO:torch.distributed.nn.jit.instantiator:Created a temporary directory at /tmp/tmppc12zh80
INFO:torch.distributed.nn.jit.instantiator:Writing /tmp/tmppc12zh80/_remote_module_non_scriptable.py
./run_local.sh: line 14: 494 Illegal instruction (core dumped) python python/main.py --profile $profile $common_opt --model $model --model-path $model_path --dataset $dataset --dataset-path $DATA_DIR --output $OUTPUT_DIR $EXTRA_OPS $@
./run.sh: line 59: 132: command not found
./run.sh: line 65: 132: command not found
CM error: Portable CM script failed (name = benchmark-program, return code = 32512)
Illegal instruction
might point to an issue with ARM architecture. We have never tried the reference implementation on aarch64.
DLRM_v2 is a datacenter only benchmark - the model itself is 98GB in size. So trying it on a 32GB system will be really hard even on x86.
(python3-venv) aarch64_sh ~> cm run script --tags=run-mlperf,inference,_find-performance,_full,_r4.1 --model=dlrm_v2-99 --implementation=reference --framework=pytorch --category=datacenter --scenario=Offline --execution_mode=test --device=cpu --quiet --test_query_count=50 INFO:root: cm run script "run-mlperf inference _find-performance _full _r4.1" INFO:root: cm run script "detect os" INFO:root: ! cd /home/ubuntu INFO:root: ! call /home/ubuntu/CM/repos/mlcommons@cm4mlops/script/detect-os/run.sh from tmp-run.sh INFO:root: ! call "postprocess" from /home/ubuntu/CM/repos/mlcommons@cm4mlops/script/detect-os/customize.py INFO:root: cm run script "detect cpu" INFO:root: cm run script "detect os" INFO:root: ! cd /home/ubuntu INFO:root: ! call /home/ubuntu/CM/repos/mlcommons@cm4mlops/script/detect-os/run.sh from tmp-run.sh INFO:root: ! call "postprocess" from /home/ubuntu/CM/repos/mlcommons@cm4mlops/script/detect-os/customize.py INFO:root: ! cd /home/ubuntu INFO:root: ! call /home/ubuntu/CM/repos/mlcommons@cm4mlops/script/detect-cpu/run.sh from tmp-run.sh INFO:root: ! call "postprocess" from /home/ubuntu/CM/repos/mlcommons@cm4mlops/script/detect-cpu/customize.py INFO:root: cm run script "get python3" INFO:root: ! load /home/ubuntu/CM/repos/local/cache/a30274b4c59046f8/cm-cached-state.json INFO:root:Path to Python: /home/ubuntu/CM/repos/local/cache/8ff2b68847874923/mlperf/bin/python3 INFO:root:Python version: 3.10.12 INFO:root: cm run script "get mlcommons inference src" INFO:root: ! load /home/ubuntu/CM/repos/local/cache/181aac323a064657/cm-cached-state.json INFO:root: cm run script "get sut description" INFO:root: cm run script "detect os" INFO:root: ! cd /home/ubuntu INFO:root: ! call /home/ubuntu/CM/repos/mlcommons@cm4mlops/script/detect-os/run.sh from tmp-run.sh INFO:root: ! call "postprocess" from /home/ubuntu/CM/repos/mlcommons@cm4mlops/script/detect-os/customize.py INFO:root: cm run script "detect cpu" INFO:root: cm run script "detect os" INFO:root: ! cd /home/ubuntu INFO:root: ! call /home/ubuntu/CM/repos/mlcommons@cm4mlops/script/detect-os/run.sh from tmp-run.sh INFO:root: ! call "postprocess" from /home/ubuntu/CM/repos/mlcommons@cm4mlops/script/detect-os/customize.py INFO:root: ! cd /home/ubuntu INFO:root: ! call /home/ubuntu/CM/repos/mlcommons@cm4mlops/script/detect-cpu/run.sh from tmp-run.sh INFO:root: ! call "postprocess" from /home/ubuntu/CM/repos/mlcommons@cm4mlops/script/detect-cpu/customize.py INFO:root: cm run script "get python3" INFO:root: ! load /home/ubuntu/CM/repos/local/cache/a30274b4c59046f8/cm-cached-state.json INFO:root:Path to Python: /home/ubuntu/CM/repos/local/cache/8ff2b68847874923/mlperf/bin/python3 INFO:root:Python version: 3.10.12 INFO:root: cm run script "get compiler" INFO:root: ! load /home/ubuntu/CM/repos/local/cache/ad4709d27e2746f6/cm-cached-state.json INFO:root: cm run script "get generic-python-lib _package.dmiparser" INFO:root: ! load /home/ubuntu/CM/repos/local/cache/487bb3df259949b6/cm-cached-state.json INFO:root: cm run script "get cache dir _name.mlperf-inference-sut-descriptions" INFO:root: ! load /home/ubuntu/CM/repos/local/cache/a1971a4c4e324cc2/cm-cached-state.json Generating SUT description file for cfe40b4a2122-pytorch INFO:root: ! call "postprocess" from /home/ubuntu/CM/repos/mlcommons@cm4mlops/script/get-mlperf-inference-sut-description/customize.py INFO:root: cm run script "get mlperf inference results dir" INFO:root: ! load /home/ubuntu/CM/repos/local/cache/5a5d8a736e15489b/cm-cached-state.json INFO:root: cm run script "install pip-package for-cmind-python _package.tabulate" INFO:root: ! load /home/ubuntu/CM/repos/local/cache/ffdaabd53c414be8/cm-cached-state.json INFO:root: cm run script "get mlperf inference utils" INFO:root: cm run script "get mlperf inference src" INFO:root: ! load /home/ubuntu/CM/repos/local/cache/181aac323a064657/cm-cached-state.json INFO:root: ! call "postprocess" from /home/ubuntu/CM/repos/mlcommons@cm4mlops/script/get-mlperf-inference-utils/customize.py Using MLCommons Inference source from /home/ubuntu/CM/repos/local/cache/0ab0359edada429b/inference
Running loadgen scenario: Offline and mode: performance INFO:root:* cm run script "app mlperf inference generic _reference _dlrm_v2-99 _pytorch _cpu _test _r4.1_default _offline"
CM error: no scripts were found with above tags and variations
variation tags ['reference', 'dlrm_v2-99', 'pytorch', 'cpu', 'test', 'r4.1_default', 'offline'] are not matching for the found script app-mlperf-inference with variations dictkeys(['cpp', 'mil', 'mlcommons-cpp', 'ctuning-cpp-tflite', 'tflite-cpp', 'reference', 'python', 'nvidia', 'mlcommons-python', 'reference,gptj', 'reference,sdxl', 'reference,dlrm-v2', 'reference,llama2-70b', 'reference,mixtral-8x7b', 'reference,resnet50', 'reference,retinanet', 'reference,bert', 'nvidia-original,r4.1-dev_default', 'nvidia-original,r4.1-devdefault,gptj', 'nvidia-original,r4.1_default', 'nvidia-original,r4.1default,gptj', 'nvidia-original,r4.1-devdefault,llama2-70b', 'nvidia-original,r4.1default,llama2-70b', 'nvidia-original', 'intel', 'intel-original', 'intel-original,gptj', 'redhat', 'qualcomm', 'kilt', 'kilt,qaic,resnet50', 'kilt,qaic,retinanet', 'kilt,qaic,bert-99', 'kilt,qaic,bert-99.9', 'intel-original,resnet50', 'intel-original,retinanet', 'intel-original,bert-99', 'intel-original,bert-99.9', 'intel-original,gptj-99', 'intel-original,gptj-99.9', 'resnet50', 'retinanet', '3d-unet-99', '3d-unet-99.9', '3d-unet', 'sdxl', 'llama2-70b', 'llama2-70b-99', 'llama2-70b-99.9', 'mixtral-8x7b', 'rnnt', 'rnnt,reference', 'gptj-99', 'gptj-99.9', 'gptj', 'gptj', 'bert', 'bert-99', 'bert-99.9', 'dlrm', 'dlrm-v2-99', 'dlrm-v2-99.9', 'dlrm_,nvidia', 'mobilenet', 'efficientnet', 'onnxruntime', 'tensorrt', 'tf', 'pytorch', 'openshift', 'ncnn', 'deepsparse', 'tflite', 'glow', 'tvm-onnx', 'tvm-pytorch', 'tvm-tflite', 'ray', 'cpu', 'cuda,reference', 'cuda', 'rocm', 'qaic', 'tpu', 'fast', 'test', 'valid,retinanet', 'valid', 'quantized', 'fp32', 'float32', 'float16', 'bfloat16', 'int4', 'int8', 'uint8', 'offline', 'multistream', 'singlestream', 'server', 'power', 'batch_size.#', 'r2.1_default', 'r3.0_default', 'r3.1_default', 'r4.0-dev_default', 'r4.0_default', 'r4.1-dev_default', 'r4.1_default']) !