mlcommons / ck

Collective Knowledge (CK, CM, CM4MLOps and CMX) is an educational project to learn how to run AI, ML and other emerging workloads in the most efficient and cost-effective way across diverse models, data sets, software and hardware.
https://cKnowledge.org
Apache License 2.0
608 stars 114 forks source link

fatal: unable to access 'https://github.com/mlcommons/ck/': GnuTLS recv error (-9): Error decoding the received TLS packet. #1177

Closed KingICCrab closed 1 week ago

KingICCrab commented 8 months ago

I want to reproduce nvidia-bert https://github.com/mlcommons/ck/blob/master/docs/mlperf/inference/bert/README_nvidia.md#build-nvidia-docker-container-from-31-inference-round when I run "cm docker script --tags=build,nvidia,inference,server", I encounter some problems. => ERROR [10/12] RUN cm pull repo mlcommons@ck 104.6s


[10/12] RUN cm pull repo mlcommons@ck:
0.255 Cloning into 'mlcommons@ck'...
104.5 error: RPC failed; curl 92 HTTP/2 stream 0 was not closed cleanly: CANCEL (err 8)
104.5 fatal: the remote end hung up unexpectedly 104.5 fatal: early EOF 104.5 fatal: index-pack failed 104.5 Warning: CM index is used for the first time. CM will reindex all artifacts now - it may take some time ... 104.5 ======================================================= 104.5 Alias: mlcommons@ck 104.5 URL: https://github.com/mlcommons/ck 104.5 104.5 Local path: /home/cmuser/CM/repos/mlcommons@ck 104.5 104.5 git clone https://github.com/mlcommons/ck mlcommons@ck 104.5 104.5 104.5 CM error: repository was not cloned!

mlperf-inference:mlpinf-v3.1-cuda12.2-cudnn8.9-x86_64-ubuntu20.04-l4-public.Dockerfile:32

30 |
31 | # Download CM repo for scripts 32 | >>> RUN cm pull repo mlcommons@ck 33 |
34 | # Install all system dependencies

ERROR: failed to solve: process "/bin/bash -c cm pull repo mlcommons@ck" did not complete successfully: exit code: 1

CM error: Portable CM script failed (name = build-docker-image, return code = 256)

gfursin commented 8 months ago

I think the problem is that GitHub was down or you don't have an access to it. Can you please try git clone https://github.com/mlcommons/ck mlcommons@ck in some temp directory to check if it works and then restart the cm command when it's working? Please tell us if it helps! Thanks!

gfursin commented 8 months ago

@KingICCrab - did you try again to see if it works? I believe it's a network issue - it happens with GitHub from time to time ;) ...

KingICCrab commented 8 months ago

Thank you for your consideration! I‘m sorry. I temporarily give up reproducing it, because I know about docker little.

gfursin commented 8 months ago

Thank you for your consideration! I‘m sorry. I temporarily give up reproducing it, because I know about docker little.

No problem. What I meant is that may I ask you to retry the same CM command and see if it works now:

cm docker script --tags=build,nvidia,inference,server

When there is a network issue, CM should restart building Docker container at the place it failed ... Thanks!

KingICCrab commented 8 months ago

After I run the command, the error is following. (These words are red!) Cloning into 'repo'... error: RPC failed; curl 28 Failed to connect to github.com port 443: Connection timed out fatal: the remote end hung up unexpectedly Traceback (most recent call last): File "/home/cmuser/.local/bin/cm", line 8, in sys.exit(run()) File "/home/cmuser/.local/lib/python3.8/site-packages/cmind/cli.py", line 35, in run r = cm.access(argv, out='con') File "/home/cmuser/.local/lib/python3.8/site-packages/cmind/core.py", line 587, in access r = action_addr(i) File "/home/cmuser/CM/repos/mlcommons@ck/cm-mlops/automation/script/module.py", line 193, in run r = self._run(i) File "/home/cmuser/CM/repos/mlcommons@ck/cm-mlops/automation/script/module.py", line 1281, in _run r = self._call_run_deps(deps, self.local_env_keys, local_env_keys_from_meta, env, state, const, const_state, add_deps_recursive, File "/home/cmuser/CM/repos/mlcommons@ck/cm-mlops/automation/script/module.py", line 2699, in _call_run_deps r = script._run_deps(deps, local_env_keys, env, state, const, const_state, add_deps_recursive, recursion_spaces, File "/home/cmuser/CM/repos/mlcommons@ck/cm-mlops/automation/script/module.py", line 2854, in _run_deps r = self.cmind.access(ii) File "/home/cmuser/.local/lib/python3.8/site-packages/cmind/core.py", line 587, in access r = action_addr(i) File "/home/cmuser/CM/repos/mlcommons@ck/cm-mlops/automation/script/module.py", line 193, in run r = self._run(i) File "/home/cmuser/CM/repos/mlcommons@ck/cm-mlops/automation/script/module.py", line 1281, in _run r = self._call_run_deps(deps, self.local_env_keys, local_env_keys_from_meta, env, state, const, const_state, add_deps_recursive, File "/home/cmuser/CM/repos/mlcommons@ck/cm-mlops/automation/script/module.py", line 2699, in _call_run_deps r = script._run_deps(deps, local_env_keys, env, state, const, const_state, add_deps_recursive, recursion_spaces, File "/home/cmuser/CM/repos/mlcommons@ck/cm-mlops/automation/script/module.py", line 2854, in _run_deps r = self.cmind.access(ii) File "/home/cmuser/.local/lib/python3.8/site-packages/cmind/core.py", line 587, in access r = action_addr(i) File "/home/cmuser/CM/repos/mlcommons@ck/cm-mlops/automation/script/module.py", line 193, in run r = self._run(i) File "/home/cmuser/CM/repos/mlcommons@ck/cm-mlops/automation/script/module.py", line 1454, in _run r = self._call_run_deps(prehook_deps, self.local_env_keys, local_env_keys_from_meta, env, state, const, const_state, add_deps_recursive, File "/home/cmuser/CM/repos/mlcommons@ck/cm-mlops/automation/script/module.py", line 2699, in _call_run_deps r = script._run_deps(deps, local_env_keys, env, state, const, const_state, add_deps_recursive, recursion_spaces, File "/home/cmuser/CM/repos/mlcommons@ck/cm-mlops/automation/script/module.py", line 2854, in _run_deps r = self.cmind.access(ii) File "/home/cmuser/.local/lib/python3.8/site-packages/cmind/core.py", line 587, in access r = action_addr(i) File "/home/cmuser/CM/repos/mlcommons@ck/cm-mlops/automation/script/module.py", line 193, in run r = self._run(i) File "/home/cmuser/CM/repos/mlcommons@ck/cm-mlops/automation/script/module.py", line 1596, in _run if dependent_cached_path != '' and not os.path.samefile(cached_path, dependent_cached_path): File "/usr/lib/python3.8/genericpath.py", line 101, in samefile s2 = os.stat(f2) FileNotFoundError: [Errno 2] No such file or directory: '/home/cmuser/CM/repos/local/cache/9d809940ee024b38/repo'

gfursin commented 8 months ago

Interesting. Thank you very much again for your feedback @KingICCrab - we didn't encounter such case before and will need to CM support to handle it in a better way! I will keep this ticket open to check it when we have time ... Thanks again!

gfursin commented 8 months ago

I improved handling of broken CM repositories (when, for example, GitHub fails): https://github.com/mlcommons/ck/commit/c39caa38ec470e1e75ddb6679e56e5b1a079e34e . It should be available in the next CM release v2.0.3 ...

gfursin commented 1 week ago

I believe it's fixed.