llm-efficiency-challenge / neurips_llm_efficiency_challenge

NeurIPS Large Language Model Efficiency Challenge: 1 LLM + 1GPU + 1Day
251 stars 56 forks source link

Toy submission issues. Incorrect file path? #29

Closed rasbt closed 1 year ago

rasbt commented 1 year ago

Hey, I am trying to run the tutorial at https://github.com/llm-efficiency-challenge/neurips_llm_efficiency_challenge/tree/master/sample-submissions/lit-gpt

It doesn't say so, but if I want to make a Lit-GPT submission, I have to cd into /neurips_llm_efficiency_challenge/sample-submissions/lit-gpt because that's where the Dockerfile is?

Now, when running the docker build step, I am getting the following error:

 master ~/neurips_llm_efficiency_challenge/sample-submissions/lit-gpt docker build -t toy_submission .
[+] Building 2.5s (12/16)                                                                     docker:default
 => [internal] load .dockerignore                                                                       0.0s
 => => transferring context: 2B                                                                         0.0s
 => [internal] load build definition from Dockerfile                                                    0.0s
 => => transferring dockerfile: 1.52kB                                                                  0.0s
 => [internal] load metadata for ghcr.io/pytorch/pytorch-nightly:c69b6e5-cu11.8.0                       0.1s
 => [ 1/12] FROM ghcr.io/pytorch/pytorch-nightly:c69b6e5-cu11.8.0@sha256:748628fda7661f7e0612299b2012c  0.0s
 => [internal] load build context                                                                       0.0s
 => => transferring context: 19.21kB                                                                    0.0s
 => CACHED [ 2/12] WORKDIR /submission                                                                  0.0s
 => CACHED [ 3/12] COPY /lit-gpt/ /submission/                                                          0.0s
 => CACHED [ 4/12] COPY ./fast_api_requirements.txt fast_api_requirements.txt                           0.0s
 => CACHED [ 5/12] RUN pip install --no-cache-dir --upgrade -r fast_api_requirements.txt                0.0s
 => CACHED [ 6/12] RUN apt-get update && apt-get install -y git                                         0.0s
 => CACHED [ 7/12] RUN pip install -r requirements.txt huggingface_hub sentencepiece                    0.0s
 => ERROR [ 8/12] RUN python scripts/download.py --repo_id openlm-research/open_llama_3b                2.3s

Trying the last step manually, it gives me an error that the scripts/download.py file doesn't exist. Shouldn't it be lit-gpt/scripts/download.py instead?

xindi-dumbledore commented 1 year ago

do git submodule update --init --recursive to get the lit-gpt content~

COPY /lit-gpt/ /submission/ copied lit-gpt content to submission, and the WORKDIR /submission set the working directory, so it should be able to access scripts

rasbt commented 1 year ago

Thanks, but it's still the same issue:

$ ~ git clone https://github.com/llm-efficiency-challenge/neurips_llm_efficiency_challenge.gt
                                                                                            Cloning into 'neurips_llm_efficiency_challenge'...
remote: Enumerating objects: 330, done.        
remote: Counting objects: 100% (98/98), done.        
remote: Compressing objects: 100% (68/68), done.        
remote: Total 330 (delta 43), reused 43 (delta 30), pack-reused 232        
Receiving objects: 100% (330/330), 193.58 KiB | 21.51 MiB/s, done.
Resolving deltas: 100% (120/120), done.

$ ~ ls     
main.py  neurips_llm_efficiency_challenge
$ ~ cd neurips_llm_efficiency_challenge 
$ master ~/neurips_llm_efficiency_challenge ls
README.md  helm.md  leaderboard.md  open_api_spec.json  run_specs.conf  sample-submissions
$ master ~/neurips_llm_efficiency_challenge git submodule update --init --recursive
Submodule 'sample-submissions/lit-gpt/lit-gpt' (https://github.com/Lightning-AI/lit-gpt) registered for path 'sample-submissions/lit-gpt/lit-gpt'
Cloning into '/teamspace/studios/this_studio/neurips_llm_efficiency_challenge/sample-submissions/lit-gpt/lit-gpt'...
Submodule path 'sample-submissions/lit-gpt/lit-gpt': checked out '1985cd8166801e9af639ba5a67fddaf4d8f3523e'
$ master ~/neurips_llm_efficiency_challenge ls
README.md  helm.md  leaderboard.md  open_api_spec.json  run_specs.conf  sample-submissions
$ master ~/neurips_llm_efficiency_challenge cd sample-submissions 
$ master ~/neurips_llm_efficiency_challenge/sample-submissions ls
lit-gpt  llama_recipes
$ master ~/neurips_llm_efficiency_challenge/sample-submissions ccd lit-gpt 
$ master ~/neurips_llm_efficiency_challenge/sample-submissions/lit-gpt ls
Dockerfile  README.md  api.py  fast_api_requirements.txt  helper.py  lit-gpt  main.py
$ master ~/neurips_llm_efficiency_challenge/sample-submissions/lit-gpt docker build -t toy_sbmission .
[+] Building 131.2s (12/16)                                                    docker:default
 => [internal] load build definition from Dockerfile                                     0.0s
 => => transferring dockerfile: 1.52kB                                                   0.0s
 => [internal] load .dockerignore                                                        0.0s
 => => transferring context: 2B                                                          0.0s
 => [internal] load metadata for ghcr.io/pytorch/pytorch-nightly:c69b6e5-cu11.8.0        0.6s
 => [ 1/12] FROM ghcr.io/pytorch/pytorch-nightly:c69b6e5-cu11.8.0@sha256:748628fda7661  89.8s
 => => resolve ghcr.io/pytorch/pytorch-nightly:c69b6e5-cu11.8.0@sha256:748628fda7661f7e  0.0s
 => => sha256:748628fda7661f7e0612299b2012ca3a9407ac920ea791398f9d553de 1.37kB / 1.37kB  0.0s
 => => sha256:0dae61445ae902e60ece5407ab4e4b1fc25567c1e32d8d5b32d6e1552 4.93kB / 4.93kB  0.0s
 => => sha256:01085d60b3a624c06a7132ff0749efc6e6565d9f2531d7685ff559f 27.51MB / 27.51MB  0.2s
 => => sha256:b562d7b85a6579a81124c487f14e9cb7660dfe9ddb963a85373987e 10.01MB / 10.01MB  0.3s
 => => sha256:0d4d46bbff85996e4aad70cc1bb7ca40394a39e15593eae54f8f2963 3.85GB / 3.85GB  29.5s
 => => sha256:4f4fb700ef54461cfa02571ae0db9a0dc1e0cdb5577484a6d75e68dc38e8acc 32B / 32B  0.4s
 => => sha256:2528d6dedc341a579aaa30dc5ad0efc1e645a7c407b608666e5191f9dffff61 99B / 99B  0.4s
 => => extracting sha256:01085d60b3a624c06a7132ff0749efc6e6565d9f2531d7685ff559fb5d0f66  0.9s
 => => extracting sha256:b562d7b85a6579a81124c487f14e9cb7660dfe9ddb963a85373987e0f98d99  0.6s
 => => extracting sha256:0d4d46bbff85996e4aad70cc1bb7ca40394a39e15593eae54f8f2963b5ca1  59.3s
 => => extracting sha256:4f4fb700ef54461cfa02571ae0db9a0dc1e0cdb5577484a6d75e68dc38e8ac  0.0s
 => => extracting sha256:2528d6dedc341a579aaa30dc5ad0efc1e645a7c407b608666e5191f9dffff6  0.0s
 => [internal] load build context                                                        0.1s
 => => transferring context: 627.22kB                                                    0.0s
 => [ 2/12] WORKDIR /submission                                                          2.8s
 => [ 3/12] COPY /lit-gpt/ /submission/                                                  0.0s
 => [ 4/12] COPY ./fast_api_requirements.txt fast_api_requirements.txt                   0.0s
 => [ 5/12] RUN pip install --no-cache-dir --upgrade -r fast_api_requirements.txt        2.2s
 => [ 6/12] RUN apt-get update && apt-get install -y git                                 9.2s
 => [ 7/12] RUN pip install -r requirements.txt huggingface_hub sentencepiece           24.1s
 => ERROR [ 8/12] RUN python scripts/download.py --repo_id openlm-research/open_llama_3  2.4s
------
 > [ 8/12] RUN python scripts/download.py --repo_id openlm-research/open_llama_3b:
Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]
2.035 Traceback (most recent call last):
2.035   File "/submission/scripts/download.py", line 74, in <module>
2.035     CLI(download_from_hub)
2.035   File "/opt/conda/lib/python3.10/site-packages/jsonargparse/_cli.py", line 96, in CLI
2.035     return _run_component(components, cfg_init)
2.035   File "/opt/conda/lib/python3.10/site-packages/jsonargparse/_cli.py", line 181, in _run_component
2.035     return component(**cfg)
2.035   File "/submission/scripts/download.py", line 45, in download_from_hub
2.035     snapshot_download(
2.035   File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
2.035     return fn(*args, **kwargs)
2.035   File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/_snapshot_download.py", line 239, in snapshot_download
2.036     thread_map(
2.036   File "/opt/conda/lib/python3.10/site-packages/tqdm/contrib/concurrent.py", line 69, in thread_map
2.036     return _executor_map(ThreadPoolExecutor, fn, *iterables, **tqdm_kwargs)
2.036   File "/opt/conda/lib/python3.10/site-packages/tqdm/contrib/concurrent.py", line 51, in _executor_map
2.036     return list(tqdm_class(ex.map(fn, *iterables, chunksize=chunksize), **kwargs))
2.036   File "/opt/conda/lib/python3.10/site-packages/tqdm/std.py", line 1178, in __iter__
2.036     for obj in iterable:
2.036   File "/opt/conda/lib/python3.10/concurrent/futures/_base.py", line 621, in result_iterator
2.036     yield _result_or_cancel(fs.pop())
2.036   File "/opt/conda/lib/python3.10/concurrent/futures/_base.py", line 319, in _result_or_cancel
2.036     return fut.result(timeout)
2.036   File "/opt/conda/lib/python3.10/concurrent/futures/_base.py", line 458, in result
2.036     return self.__get_result()
2.036   File "/opt/conda/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
2.036     raise self._exception
2.036   File "/opt/conda/lib/python3.10/concurrent/futures/thread.py", line 58, in run
2.036     result = self.fn(*self.args, **self.kwargs)
2.036   File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/_snapshot_download.py", line 214, in _inner_hf_hub_download
2.037     return hf_hub_download(
2.037   File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
2.037     return fn(*args, **kwargs)
2.037   File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1427, in hf_hub_download
2.037     _check_disk_space(expected_size, local_dir)
2.037   File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 975, in _check_disk_space
2.037     target_dir_free = shutil.disk_usage(target_dir).free
2.037   File "/opt/conda/lib/python3.10/shutil.py", line 1331, in disk_usage
2.037     st = os.statvfs(path)
2.037 FileNotFoundError: [Errno 2] No such file or directory: 'checkpoints/openlm-research/open_llama_3b'
------
Dockerfile:21
--------------------
  19 |
  20 |     # get open-llama weights: https://github.com/Lightning-AI/lit-gpt/blob/main/tutorials/download_openllama.md
  21 | >>> RUN python scripts/download.py --repo_id openlm-research/open_llama_3b
  22 |     RUN python scripts/convert_hf_checkpoint.py --checkpoint_dir checkpoints/openlm-research/open_llama_3b
  23 |
--------------------
ERROR: failed to solve: process "/bin/sh -c python scripts/download.py --repo_id openlm-research/open_llama_3b" did not complete successfully: exit code: 1
$ master ~/neurips_llm_efficiency_challenge/sample-submissions/lit-gpt 
yshr-926 commented 1 year ago

I got the same issue. It seems to work with huggingface==0.16.4 as mentioned here.

msaroufim commented 1 year ago

cc @carmocca

xindi-dumbledore commented 1 year ago

it is due to no local directory of checkpoints/openlm-research/open_llama_3b. I added

RUN mkdir checkpoints
RUN mkdir checkpoints/openlm-research
RUN mkdir checkpoints/openlm-research/open_llama_3b

in the Docker File and it works

carmocca commented 1 year ago

I filed https://github.com/huggingface/huggingface_hub/issues/1690

rasbt commented 1 year ago

Thx for the fix @carmocca @msaroufim , it works now! đź‘Ť