mlcommons / inference

Reference implementations of MLPerf™ inference benchmarks
https://mlcommons.org/en/groups/inference
Apache License 2.0
1.25k stars 537 forks source link

CM running failed when cloning from https://github.com/GATEOverflow/inference_results_v4.0.git #1746

Open Bob123Yang opened 5 months ago

Bob123Yang commented 5 months ago

I installed CM following the guide in https://docs.mlcommons.org/ck/install/ successfully

and then refer to https://docs.mlcommons.org/inference/benchmarks/language/bert/ to run the scripts as below:

cm run script --tags=run-mlperf,inference,_find-performance,_full \ --model=bert-99 \ --implementation=nvidia \ --framework=tensorrt \ --category=edge \ --scenario=Offline \ --execution_mode=test \ --device=cuda \ --docker --quiet \ --test_query_count=100

but failed when Cloning inference_results_v4.0.git from https://github.com/GATEOverflow/inference_results_v4.0.git (please see the attached log for details NVIDIA_tensorRT_Docker_Bert_log1.txt ) with the below error:


Current directory: /home/cmuser/CM/repos/local/cache/e7a6f1f4f03049e5

Cloning inference_results_v4.0.git from https://github.com/GATEOverflow/inference_results_v4.0.git

git clone https://github.com/GATEOverflow/inference_results_v4.0.git --depth 5 repo

Cloning into 'repo'... error: RPC failed; curl 56 GnuTLS recv error (-54): Error in the pull function. fatal: the remote end hung up unexpectedly fatal: early EOF fatal: index-pack failed

CM error: Portable CM script failed (name = get-git-repo, return code = 256) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Thanks!

arjunsuresh commented 5 months ago

The error looks like a network connection issue. Can you please retry?

Bob123Yang commented 5 months ago

Yeah, I tried to repeat it later and failed again with the same error as above, and then:

  1. I tried to run the git clone manually by run the command as below in the cm environment but failed with the prompt of about "repo already exist"

    git clone https://github.com/GATEOverflow/inference_results_v4.0.git --depth 5 repo


  2. So I switched to root user and deleted the folder of "repo" successfully. (fail to delete repo folder with non-roort user)
  3. And then run the git clone again manually and data transmission started: image

So I guess the network is available for git clone from github.com/GATEOverflow/, but why not workable in the docker?

Bob123Yang commented 5 months ago

Every time of building the docker container will repeat downloading or cloning some models and repos that will take much time. If some downloading or cloning were done in the last time even the whole progress of building docker container failed at last, is there any workaround to avoid downloading or cloning again?

Bob123Yang commented 5 months ago

Complete the repeat for the same script but fail in another place as below:

image

arjunsuresh commented 5 months ago

@Bob123Yang Actually you can just build the docker container once and use the same one for all other benchmarks too. For the error - can you please share more of the output as currently it says only about the pytorch build failure. Sometimes it can happen due to some incomplete download from github. Are you behind some network proxy?