Problems with cmtools - Githubissues

Agalakdak commented 2 weeks ago

Hi! I recently changed the video cards on my test bench. When trying to run familiar tests, I encountered errors.

I entered: cm run script --tags=run-mlperf,inference,_find-performance,,_full_r4.1-dev \ --model=retinanet \ --implementation=nvidia \ --framework=tensorrt \ --category=edge \ --scenario=Offline \ --execution_mode=test \ --device=cuda \ --docker --quiet \ --test_query_count=500

And got errors: INFO:root:* cm run script "run-mlperf inference _find-performance _full_r4.1-dev" CM error: no scripts were found with above tags and variations variation tags ['find-performance', 'full_r4.1-dev'] are not matching for the found script run-mlperf-inference-app with variations dict_keys(['accuracy-only', 'all-modes', 'all-scenarios', 'compliance', 'dashboard', 'find-performance', 'full', 'performance-only', 'populate-readme', 'r2.1', 'r3.0', 'r3.1', 'r4.0-dev', 'r4.0', 'r4.1-dev', 'r4.1', short', 'performance-and-accuracy', 'submission'])

Oh, I entered: (Last lines of the error message)


 /usr/include/x86_64-linux-gnu/bits/mathcalls.h(110): error: identifier "_Float32" is undefined

 Error limit reached.
 100 errors detected in the compilation of "print_cuda_devices.cu".
 Compilation terminated.

 CM error: Portable CM script failed (name = get-cuda-devices, return code = 256)

Full log 2a100_issue_2510.log

arjunsuresh commented 2 weeks ago

The docs are updated now - sorry, it had a typo before. https://docs.mlcommons.org/inference/benchmarks/object_detection/retinanet/

Agalakdak commented 2 weeks ago

Hi @arjunsuresh ! If the documents are currently being updated, please don't be rude, but can I ask when they will be ready?

Should I try using benchmarking methods from here? https://github.com/mlcommons/inference/tree/master/language/bert Or from here? https://github.com/mlcommons/inference/tree/master/vision/medical_imaging/3d-unet-kits19

arjunsuresh commented 2 weeks ago

@Agalakdak No, docs should be stable now. Unfortunately it had a bug in the previous update - which caused an extra "," coming.

We are currently adding self hosted github actions to do nightly runs on Nvidia GPUs. Have added gptj and stable diffusion so far and expected to complete the rest by next week. So, the runs should be more stable after that.

The readme files in the inference repository are meant for benchmark developers to create their own implementation. They normally are not helpful in benchmarking as the READMEs are not updated after the creation of the reference implementation.

Agalakdak commented 2 weeks ago

@arjunsuresh Thanks, it seems to work now!) I have accumulated quite a few questions regarding benchmarks. Should I create a separate top for them or can I list them here?

mlcommons / inference

Problems with cmtools #1852