unslothai / unsloth

Finetune Llama 3.2, Mistral, Phi, Qwen 2.5 & Gemma LLMs 2-5x faster with 80% less memory
https://unsloth.ai
Apache License 2.0
18.16k stars 1.27k forks source link

Errors with pip installation in Docker containers with torch 2.5 #1190

Closed SyedA5688 closed 2 weeks ago

SyedA5688 commented 3 weeks ago

Hi there, thank you for the great work on Unsloth! I am trying to install the package in a Docker containers to fine-tune Gemma models following the Colab notebooks for Gemma: https://colab.research.google.com/drive/1vIrqH5uYDQwsJ4-OO3DErvuv4pBgVwk4?usp=sharing

I am working with a Docker file (below) where I am trying to emulate the package versions in the Colab runtime, which is torch==2.5 and unsloth==2024.10.7.

FROM pytorch/pytorch:2.5.0-cuda12.1-cudnn9-runtime

# Below command creates home dir for 1000 UID user if it is not present.
RUN if ! id 1000; then useradd -m -u 1000 clouduser; fi

RUN mkdir /workdir
WORKDIR /workdir

ENV LANG=C.UTF-8
RUN apt-get update --allow-releaseinfo-change && apt-get install -y git python3 python3-pip netcat

RUN python3 -m pip install --upgrade pip

# Below command installs required libraries
COPY requirements.txt /workdir/pytorch_xcloud_training/requirements.txt
RUN python3 -m pip --no-cache-dir install -r pytorch_xcloud_training/requirements.txt

RUN python3 -m pip install peft==0.13.2

# Below command installs unsloth
RUN python3 -m pip install "unsloth[cu121-torch250] @ git+https://github.com/unslothai/unsloth.git"
RUN python3 -m pip install unsloth-zoo

# Install flash attention library
RUN python3 -m pip install flash-attn --no-build-isolation

COPY train_C2S_unsloth_HF_torch_lora.py /workdir/pytorch_xcloud_training/train_C2S_unsloth_HF_torch_lora.py

# Below command make 1000 UID and root user as owners of the workdir.
RUN chown -R 1000:root /workdir && chmod -R 775 /workdir

ENTRYPOINT ["python3", "-m", "pytorch_xcloud_training.train_C2S_unsloth_HF_torch_lora"]

I have been running into errors with my pip installation for a while now in the call to model = FastLanguageModel.get_peft_model():

TypeError: empty() received an invalid combination of arguments - got (tuple, dtype=NoneType, device=NoneType), but expected one of:
 * (tuple of ints size, *, tuple of names names, torch.memory_format memory_format = None, torch.dtype dtype = None, torch.layout layout = None, torch.device device = None, bool pin_memory = False, bool requires_grad = False)
 * (tuple of ints size, *, torch.memory_format memory_format = None, Tensor out = None, torch.dtype dtype = None, torch.layout layout = None, torch.device device = None, bool pin_memory = False, bool requires_grad = False)

I am unsure of what is causing this error. I am able to create a working Anaconda environment following the Anaconda installation instructions in the ReadME, and the Colab runs, however any Docker environment I create runs into this error. Any advice on this would be greatly appreciated!

danielhanchen commented 2 weeks ago

@SyedA5688 Wait we don't yet have a Torch 2.5 tag yet - probably best to just use unsloth without the tag

SyedA5688 commented 2 weeks ago

I see, thanks for your quick reply! I've gotten farther along with trying to build a Docker container with torch 2.5 and unsloth, however I still run into the above error. Is there a Docker image available for unsloth by any chance that I could base my image off of?

danielhanchen commented 2 weeks ago

@SyedA5688 One way is to use the vLLM Docker image, and add Unsloth together (although vLLM doesn't yet support torch 2.5)

Also I just added unsloth[cu121-torch250] as well!

SyedA5688 commented 2 weeks ago

Thank you for updating to torch 2.5!

After some more experimenting with Docker files, I have landed on this Docker file based on some other posts:

ARG CUDA_VERSION="12.2.2"
ARG UBUNTU_VERSION="22.04"
ARG DOCKER_FROM=nvidia/cuda:$CUDA_VERSION-devel-ubuntu$UBUNTU_VERSION
FROM $DOCKER_FROM AS base

# Below command creates home dir for 1000 UID user if it is not present.
RUN if ! id 1000; then useradd -m -u 1000 clouduser; fi

RUN mkdir /workdir
WORKDIR /workdir

ENV LANG=C.UTF-8
RUN apt-get update --allow-releaseinfo-change && apt-get install -y git python3 python3-pip netcat

RUN apt-get update -y && \
    apt-get install -y python3 python3-pip && \
    apt-get install -y --no-install-recommends git && \
    python3 -m pip install --upgrade pip && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

RUN python3 -m pip install "trl<=0.9.0" peft==0.10.0 bitsandbytes==0.43.3 transformers[sentencepiece]==4.43.4
RUN python3 -m pip install torch==2.2.1+cu121 torchvision --index-url https://download.pytorch.org/whl/cu121
RUN python3 -m pip install "unsloth @ git+https://github.com/unslothai/unsloth.git@d0ca3497eb5911483339be025e9924cf73280178"
RUN python3 -m pip install --no-deps "xformers<0.0.26" --force-reinstall
RUN python3 -m pip install flash_attn==2.6.3
RUN python3 -m pip install absl-py
RUN python3 -m pip install numpy==1.26.4

COPY train_C2S_unsloth_HF_torch_lora.py /workdir/pytorch_xcloud_training/train_C2S_unsloth_HF_torch_lora.py

# Below command make 1000 UID and root user as owners of the workdir.
RUN chown -R 1000:root /workdir && chmod -R 775 /workdir

ENTRYPOINT ["python3", "-m", "pytorch_xcloud_training.train_C2S_unsloth_HF_torch_lora"]

When I run with this Docker file, all libraries seem set up correctly (no errors with CUDA or flash attention), however I still run into the same error as I mentioned above:

TypeError: empty() received an invalid combination of arguments - got (tuple, dtype=NoneType, device=NoneType), but expected one of:
 * (tuple of ints size, *, tuple of names names, torch.memory_format memory_format = None, torch.dtype dtype = None, torch.layout layout = None, torch.device device = None, bool pin_memory = False, bool requires_grad = False)
 * (tuple of ints size, *, torch.memory_format memory_format = None, Tensor out = None, torch.dtype dtype = None, torch.layout layout = None, torch.device device = None, bool pin_memory = False, bool requires_grad = False)

This error occurs in the call to FastLanguageModel.get_peft_model(). Full stack trace:

Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/workdir/pytorch_xcloud_training/train_C2S_unsloth_HF_torch_lora.py", line 189, in <module>
    app.run(main)
  File "/usr/local/lib/python3.10/dist-packages/absl/app.py", line 308, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.10/dist-packages/absl/app.py", line 254, in _run_main
    sys.exit(main(argv))
  File "/workdir/pytorch_xcloud_training/train_C2S_unsloth_HF_torch_lora.py", line 97, in main
    model = FastLanguageModel.get_peft_model(
  File "/usr/local/lib/python3.10/dist-packages/unsloth/models/llama.py", line 2125, in get_peft_model
    model = _get_peft_model(model, lora_config)
  File "/usr/local/lib/python3.10/dist-packages/peft/mapping.py", line 136, in get_peft_model
    return MODEL_TYPE_TO_PEFT_MODEL_MAPPING[peft_config.task_type](model, peft_config, adapter_name=adapter_name)
  File "/usr/local/lib/python3.10/dist-packages/peft/peft_model.py", line 1094, in __init__
    super().__init__(model, peft_config, adapter_name)
  File "/usr/local/lib/python3.10/dist-packages/peft/peft_model.py", line 129, in __init__
    self.base_model = cls(model, {adapter_name: peft_config}, adapter_name)
  File "/usr/local/lib/python3.10/dist-packages/peft/tuners/lora/model.py", line 136, in __init__
    super().__init__(model, config, adapter_name)
  File "/usr/local/lib/python3.10/dist-packages/peft/tuners/tuners_utils.py", line 148, in __init__
    self.inject_adapter(self.model, adapter_name)
  File "/usr/local/lib/python3.10/dist-packages/peft/tuners/tuners_utils.py", line 325, in inject_adapter
    self._create_and_replace(peft_config, adapter_name, target, target_name, parent, current_key=key)
  File "/usr/local/lib/python3.10/dist-packages/peft/tuners/lora/model.py", line 220, in _create_and_replace
    new_module = self._create_new_module(lora_config, adapter_name, target, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/peft/tuners/lora/model.py", line 295, in _create_new_module
    new_module = dispatcher(target, adapter_name, lora_config=lora_config, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/peft/tuners/lora/bnb.py", line 506, in dispatch_bnb_4bit
    new_module = Linear4bit(target, adapter_name, **fourbit_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/peft/tuners/lora/bnb.py", line 293, in __init__
    self.update_layer(
  File "<string>", line 17, in LoraLayer_update_layer
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py", line 98, in __init__
    self.weight = Parameter(torch.empty((out_features, in_features), **factory_kwargs))
TypeError: empty() received an invalid combination of arguments - got (tuple, dtype=NoneType, device=NoneType), but expected one of:
 * (tuple of ints size, *, tuple of names names, torch.memory_format memory_format = None, torch.dtype dtype = None, torch.layout layout = None, torch.device device = None, bool pin_memory = False, bool requires_grad = False)
 * (tuple of ints size, *, torch.memory_format memory_format = None, Tensor out = None, torch.dtype dtype = None, torch.layout layout = None, torch.device device = None, bool pin_memory = False, bool requires_grad = False)

This error does not come up when running Unsloth Gemma-2 LoRA training in Google colab (torch 2.5.0+cu121 and unsloth 2024.10.7) or Anaconda environments I tried on other Linux machines (which had pytorch 2.4.1 and unsloth 2024.10.0). Any idea what might be causing this particular error?

SyedA5688 commented 2 weeks ago

Update: after some more debugging, I found out the issue: I was passing in lora_rank (r) = 16.0, which is a float, rather than an integer; the error made it hard to spot that I had my types incorrect for the argument. When I passed in 16, the error was resolved.

Closing this issue, thank you for your responsiveness!