hejing / instance_containize

any issues when using the bitdeer.ai
0 stars 0 forks source link

Docker image vs OS image #1

Open jianwang-ntu opened 1 month ago

jianwang-ntu commented 1 month ago

Explain why current not satisfy

Lots of AI developers focus on GPU tasks and want to easily train their jobs, such as simply modifying little lines of code and starting it to train.

So, quick and easy using is essential. I suggest an end-to-end docker image that includes some high-rated libraries in this docker image, the user does not need to install it by themselves.

For example, The installation of Cudatoolkit and Anaconda took a long time (40 minutes), this should be avoided if integrated into a docker image rather than an os image.

How to reproduce

pip install deepspeed

ls /usr/local/cuda

which nvcc 

Error tips

Collecting deepspeed (from -r requirements.txt (line 5))
  Using cached deepspeed-0.14.2.tar.gz (1.3 MB)
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error

  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [27 lines of output]
      [2024-05-27 01:37:52,346] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
      [2024-05-27 01:37:52,450] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/tmp/pip-install-cup88g1b/deepspeed_0549dc6032cc47d4a5e510880856770f/setup.py", line 37, in <module>
          from op_builder import get_default_compute_capabilities, OpBuilder
        File "/tmp/pip-install-cup88g1b/deepspeed_0549dc6032cc47d4a5e510880856770f/op_builder/__init__.py", line 18, in <module>
          import deepspeed.ops.op_builder  # noqa: F401 # type: ignore
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/tmp/pip-install-cup88g1b/deepspeed_0549dc6032cc47d4a5e510880856770f/deepspeed/__init__.py", line 25, in <module>
          from . import ops
        File "/tmp/pip-install-cup88g1b/deepspeed_0549dc6032cc47d4a5e510880856770f/deepspeed/ops/__init__.py", line 15, in <module>
          from ..git_version_info import compatible_ops as __compatible_ops__
        File "/tmp/pip-install-cup88g1b/deepspeed_0549dc6032cc47d4a5e510880856770f/deepspeed/git_version_info.py", line 29, in <module>
          op_compatible = builder.is_compatible()
                          ^^^^^^^^^^^^^^^^^^^^^^^
        File "/tmp/pip-install-cup88g1b/deepspeed_0549dc6032cc47d4a5e510880856770f/op_builder/fp_quantizer.py", line 29, in is_compatible
          sys_cuda_major, _ = installed_cuda_version()
                              ^^^^^^^^^^^^^^^^^^^^^^^^
        File "/tmp/pip-install-cup88g1b/deepspeed_0549dc6032cc47d4a5e510880856770f/op_builder/builder.py", line 50, in installed_cuda_version
          raise MissingCUDAException("CUDA_HOME does not exist, unable to compile CUDA op(s)")
      op_builder.builder.MissingCUDAException: CUDA_HOME does not exist, unable to compile CUDA op(s)
       [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
       [WARNING]  async_io: please install the libaio-dev package with apt
       [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
       [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

Recommendation resolving idea

please build the docker image from nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04

# for example 
docker pull  nvidia/cuda:12.3.2-cudnn9-devel-ubuntu22.04
# if with this dockerimages, we can see the nvcc installed path 

which nvcc 
jianwang-ntu commented 1 month ago

The high frequently system library is cudatoolkit, anaconda, and some python library like pip install pytorch, transformers, datasets

python -m torch.utils.collect_env



- if possible, please share a tutorial script in the user home folder, on how to finetune with Lora in a toy dataset.