Hi, I'm trying to do a distributed training on llama-7b in a VM having two Tesla T4 GPU's using ray with strategy as deepspeed. I'm facing the following error "Could not pickle object as excessively deep recursion required."

(TrainTrainable pid=197290) /root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32
(TrainTrainable pid=197290) /root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
(TrainTrainable pid=197290)   warn("The installed version of bitsandbytes was compiled without GPU support. "
(RayTrainWorker pid=197388) 2023-11-17 12:34:42,693 INFO config.py:86 -- Setting up process group for: env:// [rank=0, world_size=2]
(TorchTrainer pid=197290) 2023-11-17 12:34:42,794   INFO bulk_executor.py:39 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[read->randomize_block_order]
Stage 0:   0%|          | 0/1 [00:00<?, ?it/s]=197564) 
  0%|          | 0/1 [00:00<?, ?it/s]or pid=197564) 
Stage 1:   0%|          | 0/1 [00:00<?, ?it/s]7564) 
(TorchTrainer pid=197290) 2023-11-17 12:34:44,518   INFO bulk_executor.py:39 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[read->randomize_block_order]
(TorchTrainer pid=197290) /root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ray/data/_internal/pipelined_dataset_iterator.py:126: UserWarning: session.get_dataset_shard returns a ray.data.DatasetIterator instead of a DatasetPipeline as of Ray v2.3. Use iter_torch_batches(), to_tf(), or iter_batches() to iterate over one epoch. See https://docs.ray.io/en/latest/data/api/dataset_iterator.html for full DatasetIterator docs.
(TorchTrainer pid=197290)   warnings.warn(
(TorchTrainer pid=197290) /root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ray/data/_internal/bulk_dataset_iterator.py:108: UserWarning: session.get_dataset_shard returns a ray.data.DatasetIterator instead of a Dataset as of Ray v2.3. Use iter_torch_batches(), to_tf(), or iter_batches() to iterate over one epoch. See https://docs.ray.io/en/latest/data/api/dataset_iterator.html for full DatasetIterator docs.
(TorchTrainer pid=197290)   warnings.warn(
(RayTrainWorker pid=197388) [2023-11-17 12:34:50,915] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
(RayTrainWorker pid=197389) [2023-11-17 12:34:51,332] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
(RayTrainWorker pid=197389) Using DeepSpeed strategy
(RayTrainWorker pid=197389) [2023-11-17 12:34:52,968] [INFO] [comm.py:637:init_distributed] cdb=None
(RayTrainWorker pid=197388) Using DeepSpeed strategy
(RayTrainWorker pid=197388) [2023-11-17 12:34:52,968] [INFO] [comm.py:637:init_distributed] cdb=None
(RayTrainWorker pid=197389) Exception raised during training by one of the workers
(RayTrainWorker pid=197389) Traceback (most recent call last):
(RayTrainWorker pid=197389)   File "/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ludwig/backend/ray.py", line 206, in train_fn
(RayTrainWorker pid=197389)     model = distributed.replace_model_from_serialization(ray.get(model_ref))
(RayTrainWorker pid=197389)   File "/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ludwig/distributed/deepspeed.py", line 206, in replace_model_from_serialization
(RayTrainWorker pid=197389)     replace_tensors(model, model_weights, torch.device("cpu"))
(RayTrainWorker pid=197389)   File "/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ludwig/utils/model_utils.py", line 69, in replace_tensors
(RayTrainWorker pid=197389)     torch.nn.Parameter(torch.as_tensor(array, device=device, dtype=NUMPY_TO_TORCH_DTYPE.get(array.dtype))),
(RayTrainWorker pid=197389)   File "/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/torch/nn/parameter.py", line 39, in __new__
(RayTrainWorker pid=197389)     return torch.Tensor._make_subclass(cls, data, requires_grad)
(RayTrainWorker pid=197389) RuntimeError: Only Tensors of floating point and complex dtype can require gradients
(RayTrainWorker pid=197388) Exception raised during training by one of the workers
(RayTrainWorker pid=197388) Traceback (most recent call last):
(RayTrainWorker pid=197388)   File "/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ludwig/backend/ray.py", line 206, in train_fn
(RayTrainWorker pid=197388)     model = distributed.replace_model_from_serialization(ray.get(model_ref))
(RayTrainWorker pid=197388)   File "/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ludwig/distributed/deepspeed.py", line 206, in replace_model_from_serialization
(RayTrainWorker pid=197388)     replace_tensors(model, model_weights, torch.device("cpu"))
(RayTrainWorker pid=197388)   File "/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ludwig/utils/model_utils.py", line 69, in replace_tensors
(RayTrainWorker pid=197388)     torch.nn.Parameter(torch.as_tensor(array, device=device, dtype=NUMPY_TO_TORCH_DTYPE.get(array.dtype))),
(RayTrainWorker pid=197388)   File "/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/torch/nn/parameter.py", line 39, in __new__
(RayTrainWorker pid=197388)     return torch.Tensor._make_subclass(cls, data, requires_grad)
(RayTrainWorker pid=197388) RuntimeError: Only Tensors of floating point and complex dtype can require gradients
(RayTrainWorker pid=197388) /root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ray/data/_internal/bulk_dataset_iterator.py:108: UserWarning: session.get_dataset_shard returns a ray.data.DatasetIterator instead of a Dataset as of Ray v2.3. Use iter_torch_batches(), to_tf(), or iter_batches() to iterate over one epoch. See https://docs.ray.io/en/latest/data/api/dataset_iterator.html for full DatasetIterator docs.
(RayTrainWorker pid=197388)   warnings.warn(
(RayTrainWorker pid=197389) /root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ray/data/_internal/bulk_dataset_iterator.py:108: UserWarning: session.get_dataset_shard returns a ray.data.DatasetIterator instead of a Dataset as of Ray v2.3. Use iter_torch_batches(), to_tf(), or iter_batches() to iterate over one epoch. See https://docs.ray.io/en/latest/data/api/dataset_iterator.html for full DatasetIterator docs.
(RayTrainWorker pid=197389)   warnings.warn(
(RayTrainWorker pid=197388) /root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ludwig/utils/model_utils.py:75: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:206.)
(RayTrainWorker pid=197388)   torch.as_tensor(array, device=device, dtype=NUMPY_TO_TORCH_DTYPE.get(array.dtype)),
(RayTrainWorker pid=197389) /root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ludwig/utils/model_utils.py:75: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:206.)
(RayTrainWorker pid=197389)   torch.as_tensor(array, device=device, dtype=NUMPY_TO_TORCH_DTYPE.get(array.dtype)),
2023-11-17 12:34:53,276 WARNING worker.py:1866 -- Traceback (most recent call last):
  File "python/ray/_raylet.pyx", line 850, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 902, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 857, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 861, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 803, in ray._raylet.execute_task.function_executor
  File "/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ray/_private/function_manager.py", line 674, in actor_method_executor
    return method(__ray_actor, *args, **kwargs)
  File "/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ray/util/tracing/tracing_helper.py", line 466, in _resume_span
    return method(self, *_args, **_kwargs)
  File "/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ray/train/_internal/worker_group.py", line 31, in __execute
    raise skipped from exception_cause(skipped)
  File "/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ray/train/_internal/utils.py", line 129, in discard_return_wrapper
    train_func(*args, **kwargs)
  File "/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ludwig/backend/ray.py", line 501, in <lambda>
    lambda config: train_fn(**config),
  File "/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ludwig/backend/ray.py", line 206, in train_fn
    model = distributed.replace_model_from_serialization(ray.get(model_ref))
  File "/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ludwig/distributed/deepspeed.py", line 206, in replace_model_from_serialization
    replace_tensors(model, model_weights, torch.device("cpu"))
  File "/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ludwig/utils/model_utils.py", line 69, in replace_tensors
    torch.nn.Parameter(torch.as_tensor(array, device=device, dtype=NUMPY_TO_TORCH_DTYPE.get(array.dtype))),
  File "/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/torch/nn/parameter.py", line 39, in __new__
    return torch.Tensor._make_subclass(cls, data, requires_grad)
RuntimeError: Only Tensors of floating point and complex dtype can require gradients

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 733, in dump
    return Pickler.dump(self, obj)
  File "/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 826, in reducer_override
    if sys.version_info[:2] < (3, 7) and _is_parametrized_type_hint(
RecursionError: maximum recursion depth exceeded in comparison

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "python/ray/_raylet.pyx", line 1166, in ray._raylet.task_execution_handler
  File "python/ray/_raylet.pyx", line 1072, in ray._raylet.execute_task_with_cancellation_handler
  File "python/ray/_raylet.pyx", line 805, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 972, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 611, in ray._raylet.store_task_errors
  File "python/ray/_raylet.pyx", line 2524, in ray._raylet.CoreWorker.store_task_outputs
  File "/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ray/_private/serialization.py", line 450, in serialize
    return self._serialize_to_msgpack(value)
  File "/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ray/_private/serialization.py", line 405, in _serialize_to_msgpack
    value = value.to_bytes()
  File "/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ray/exceptions.py", line 32, in to_bytes
    serialized_exception=pickle.dumps(self),
  File "/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 88, in dumps
    cp.dump(obj)
  File "/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 739, in dump
    raise pickle.PicklingError(msg) from e
_pickle.PicklingError: Could not pickle object as excessively deep recursion required.

My current OS is ubuntu :20.04 python version: 3.10.13 model.yaml:

base_model: /root/CodeLlama-7b-Python-hf

quantization:
  bits: 4

adapter:
  type: lora

prompt:
  template: |
    ### Instruction:
    {Instruction}

    ### Context:
    {Context}

    ### Input:
    {Input}

    ### Response:

input_features:
  - name: prompt
    type: text
    preprocessing:
      max_sequence_length: 2048

output_features:
  - name: Response
    type: text
    preprocessing:
      max_sequence_length: 2048

trainer:
  type: finetune
  learning_rate: 0.0001
  batch_size: 1
  max_batch_size: 1
  gradient_accumulation_steps: 1
  enable_gradient_checkpointing: true
  epochs: 3
  learning_rate_scheduler:
    warmup_fraction: 0.01

preprocessing:
  sample_ratio: 1.0

backend:
  type: ray
  trainer:
    use_gpu: true
    strategy: deepspeed

Environment:

absl-py                       2.0.0
accelerate                    0.24.1
aiohttp                       3.8.6
aiohttp-cors                  0.7.0
aiorwlock                     1.3.0
aiosignal                     1.3.1
anyio                         3.7.1
asttokens                     2.4.1
async-timeout                 4.0.3
attrs                         23.1.0
backports.functools-lru-cache 1.6.5
bitsandbytes                  0.40.2
bitsandbytes-cuda116          0.26.0.post2
bleach                        6.1.0
blessed                       1.20.0
blis                          0.7.11
cachetools                    5.3.2
catalogue                     2.0.10
certifi                       2023.7.22
charset-normalizer            3.3.2
click                         8.1.7
cloudpathlib                  0.16.0
cloudpickle                   1.6.0
colorful                      0.5.5
comm                          0.1.4
commonmark                    0.9.1
confection                    0.1.3
cymem                         2.0.8
Cython                        3.0.5
dask                          2023.3.2
dataclasses-json              0.6.2
datasets                      2.15.0
debugpy                       1.6.7
decorator                     5.1.1
deepspeed                     0.12.3
dill                          0.3.7
distlib                       0.3.7
distributed                   2023.3.2
entrypoints                   0.4
et-xmlfile                    1.1.0
exceptiongroup                1.1.3
executing                     2.0.1
fastapi                       0.104.1
filelock                      3.13.1
frozenlist                    1.4.0
fsspec                        2023.9.2
getdaft                       0.1.20
google-api-core               2.14.0
google-auth                   2.23.4
google-auth-oauthlib          1.1.0
googleapis-common-protos      1.61.0
gpustat                       1.1.1
GPUtil                        1.4.0
grpcio                        1.59.2
h11                           0.14.0
h5py                          3.10.0
hjson                         3.1.0
html5lib                      1.1
httptools                     0.6.1
huggingface-hub               0.19.3
idna                          3.4
importlib-metadata            6.8.0
ipykernel                     6.26.0
ipython                       8.17.2
ipywidgets                    8.1.1
jedi                          0.19.1
Jinja2                        3.1.2
joblib                        1.3.2
jsonschema                    4.6.2
jupyter-client                7.3.4
jupyter_core                  5.5.0
jupyterlab-widgets            3.0.9
kaggle                        1.5.16
langcodes                     3.3.0
lightning-utilities           0.9.0
locket                        1.0.0
loguru                        0.7.2
ludwig                        0.8.6
lxml                          4.9.3
Markdown                      3.5.1
MarkupSafe                    2.1.3
marshmallow                   3.20.1
marshmallow-dataclass         8.5.4
marshmallow-jsonschema        0.13.0
matplotlib-inline             0.1.6
mpi4py                        3.1.4
mpmath                        1.3.0
msgpack                       1.0.7
multidict                     6.0.4
multiprocess                  0.70.15
murmurhash                    1.0.10
mypy-extensions               1.0.0
nest-asyncio                  1.5.8
networkx                      3.2.1
ninja                         1.11.1.1
nltk                          3.8.1
numpy                         1.26.2
nvidia-cublas-cu12            12.1.3.1
nvidia-cuda-cupti-cu12        12.1.105
nvidia-cuda-nvrtc-cu12        12.1.105
nvidia-cuda-runtime-cu12      12.1.105
nvidia-cudnn-cu12             8.9.2.26
nvidia-cufft-cu12             11.0.2.54
nvidia-curand-cu12            10.3.2.106
nvidia-cusolver-cu12          11.4.5.107
nvidia-cusparse-cu12          12.1.0.106
nvidia-ml-py                  12.535.133
nvidia-nccl-cu12              2.18.1
nvidia-nvjitlink-cu12         12.3.101
nvidia-nvtx-cu12              12.1.105
oauthlib                      3.2.2
opencensus                    0.11.3
opencensus-context            0.1.3
openpyxl                      3.1.2
packaging                     23.2
pandas                        2.1.3
parso                         0.8.3
partd                         1.4.1
peft                          0.6.2
pexpect                       4.8.0
pickleshare                   0.7.5
Pillow                        10.1.0
pip                           23.3
platformdirs                  3.11.0
preshed                       3.0.9
prometheus-client             0.18.0
prompt-toolkit                3.0.41
protobuf                      3.20.3
psutil                        5.9.4
ptyprocess                    0.7.0
pure-eval                     0.2.2
py                            1.11.0
py-cpuinfo                    9.0.0
py-spy                        0.3.14
pyarrow                       14.0.1
pyarrow-hotfix                0.5
pyasn1                        0.5.0
pyasn1-modules                0.3.0
pydantic                      1.10.13
Pygments                      2.16.1
pynvml                        11.5.0
pyrsistent                    0.20.0
python-dateutil               2.8.2
python-dotenv                 1.0.0
python-slugify                8.0.1
pytz                          2023.3.post1
pyxlsb                        1.0.10
PyYAML                        6.0
pyzmq                         25.1.0
ray                           2.3.1
regex                         2023.10.3
requests                      2.31.0
requests-oauthlib             1.3.1
retry                         0.9.2
rich                          12.4.4
rsa                           4.9
sacremoses                    0.1.1
safetensors                   0.4.0
scikit-learn                  1.3.2
scipy                         1.11.3
sentencepiece                 0.1.99
setuptools                    68.0.0
six                           1.16.0
smart-open                    6.4.0
sniffio                       1.3.0
sortedcontainers              2.4.0
spacy                         3.7.2
spacy-legacy                  3.0.12
spacy-loggers                 1.0.5
srsly                         2.4.8
stack-data                    0.6.2
starlette                     0.27.0
sympy                         1.12
tabulate                      0.9.0
tblib                         3.0.0
tensorboard                   2.15.1
tensorboard-data-server       0.7.2
tensorboardX                  2.6.2.2
text-unidecode                1.3
thinc                         8.2.1
threadpoolctl                 3.2.0
tokenizers                    0.15.0
toolz                         0.12.0
torch                         2.1.1
torchaudio                    2.1.1
torchdata                     0.7.1
torchinfo                     1.8.0
torchmetrics                  0.11.4
torchtext                     0.16.1
torchvision                   0.16.1
tornado                       6.1
tqdm                          4.66.1
traitlets                     5.13.0
transformers                  4.35.2
triton                        2.1.0
typer                         0.9.0
typing_extensions             4.8.0
typing-inspect                0.9.0
tzdata                        2023.3
urllib3                       2.1.0
uvicorn                       0.24.0.post1
uvloop                        0.19.0
virtualenv                    20.21.0
wasabi                        1.1.2
watchfiles                    0.21.0
wcwidth                       0.2.10
weasel                        0.3.4
webencodings                  0.5.1
websockets                    12.0
Werkzeug                      3.0.1
wheel                         0.41.2
widgetsnbextension            4.0.9
xlrd                          2.0.1
XlsxWriter                    3.1.9
xlwt                          1.3.0
xxhash                        3.4.1
yarl                          1.9.2
zict                          3.0.0
zipp                          3.17.0

Can you guide me in solving this Thanks in advance!!

Hi @Ragul-Ramdass -- thank you for reporting this issue and the one in #3784 -- please give us a few business days to look into it and get back to you. Thank you.

I'm facing the exact same issue with both the strategies - deepspeed and ddp. Below is the conda environment, and model.yaml for reference:

requirement.txt

absl-py==2.0.0 accelerate==0.24.1 aiohttp==3.9.0 aiohttp-cors==0.7.0 aiorwlock==1.3.0 aiosignal==1.3.1 anyio==3.7.1 asttokens @ file:///home/conda/feedstock_root/build_artifacts/asttokens_1698341106958/work async-timeout==4.0.3 attrs==23.1.0 awscli==1.30.3 backports.functools-lru-cache @ file:///home/conda/feedstock_root/build_artifacts/backports.functools_lru_cache_1687772187254/work beautifulsoup4==4.12.2 bitsandbytes==0.40.2 bleach==6.1.0 blessed==1.20.0 blinker==1.7.0 blis==0.7.11 botocore==1.32.3 Brotli==1.1.0 cachetools==5.3.2 captum==0.6.0 catalogue==2.0.10 certifi==2023.11.17 charset-normalizer==3.3.2 click==8.1.7 cloudpathlib==0.16.0 cloudpickle==3.0.0 colorama==0.4.4 colorful==0.5.5 comm @ file:///home/conda/feedstock_root/build_artifacts/comm_1691044910542/work commonmark==0.9.1 confection==0.1.3 contourpy==1.2.0 cycler==0.12.1 cymem==2.0.8 Cython==3.0.5 dask==2023.3.2 dataclasses-json==0.6.2 datasets==2.15.0 debugpy @ file:///croot/debugpy_1690905042057/work decorator @ file:///home/conda/feedstock_root/build_artifacts/decorator_1641555617451/work deepspeed==0.12.3 dill==0.3.7 distlib==0.3.7 docutils==0.16 entrypoints @ file:///home/conda/feedstock_root/build_artifacts/entrypoints_1643888246732/work et-xmlfile==1.1.0 exceptiongroup @ file:///home/conda/feedstock_root/build_artifacts/exceptiongroup_1692026125334/work executing @ file:///home/conda/feedstock_root/build_artifacts/executing_1698579936712/work faiss-cpu==1.7.4 fastapi==0.104.1 filelock==3.13.1 Flask==3.0.0 Flask-Compress==1.14 fonttools==4.44.3 frozenlist==1.4.0 fsspec==2023.9.2 future==0.18.3 getdaft==0.1.20 google-api-core==2.14.0 google-auth==2.23.4 google-auth-oauthlib==1.1.0 googleapis-common-protos==1.61.0 gpustat==1.1.1 GPUtil==1.4.0 grpcio==1.51.3 h11==0.14.0 h5py==3.10.0 hiplot==0.1.33 hjson==3.1.0 html5lib==1.1 httpcore==1.0.2 httpx==0.25.1 huggingface-hub==0.19.4 hummingbird-ml==0.4.9 hyperopt==0.2.7 idna==3.4 importlib-metadata==6.8.0 ipykernel @ file:///home/conda/feedstock_root/build_artifacts/ipykernel_1698244021190/work ipython @ file:///home/conda/feedstock_root/build_artifacts/ipython_1698846603011/work itsdangerous==2.1.2 jedi @ file:///home/conda/feedstock_root/build_artifacts/jedi_1696326070614/work Jinja2==3.1.2 jmespath==1.0.1 joblib==1.3.2 jsonschema==4.6.2 jupyter-client @ file:///home/conda/feedstock_root/build_artifacts/jupyter_client_1654730843242/work jupyter_core @ file:///home/conda/feedstock_root/build_artifacts/jupyter_core_1698673647019/work kaggle==1.5.16 kiwisolver==1.4.5 langcodes==3.3.0 lightgbm==4.1.0 lightgbm-ray==0.1.9 locket==1.0.0 loguru==0.7.2 loralib==0.1.2 ludwig @ git+https://github.com/ludwig-ai/ludwig.git@8c47c3cb16a972e0c27818a2124a3e0359142ca0 lxml==4.9.3 Markdown==3.5.1 MarkupSafe==2.1.3 marshmallow==3.20.1 marshmallow-dataclass==8.5.4 marshmallow-jsonschema==0.13.0 matplotlib==3.8.2 matplotlib-inline @ file:///home/conda/feedstock_root/build_artifacts/matplotlib-inline_1660814786464/work mpi4py @ file:///croot/mpi4py_1671223370575/work mpmath==1.3.0 msgpack==1.0.7 multidict==6.0.4 multiprocess==0.70.15 murmurhash==1.0.10 mypy-extensions==1.0.0 nest-asyncio @ file:///home/conda/feedstock_root/build_artifacts/nest-asyncio_1697083700168/work networkx==3.2.1 ninja==1.11.1.1 nltk==3.8.1 numpy==1.26.2 nvidia-cublas-cu12==12.1.3.1 nvidia-cuda-cupti-cu12==12.1.105 nvidia-cuda-nvrtc-cu12==12.1.105 nvidia-cuda-runtime-cu12==12.1.105 nvidia-cudnn-cu12==8.9.2.26 nvidia-cufft-cu12==11.0.2.54 nvidia-curand-cu12==10.3.2.106 nvidia-cusolver-cu12==11.4.5.107 nvidia-cusparse-cu12==12.1.0.106 nvidia-ml-py==12.535.133 nvidia-nccl-cu12==2.18.1 nvidia-nvjitlink-cu12==12.3.101 nvidia-nvtx-cu12==12.1.105 oauthlib==3.2.2 onnx==1.15.0 onnxconverter-common==1.13.0 opencensus==0.11.3 opencensus-context==0.1.3 openpyxl==3.1.2 packaging @ file:///home/conda/feedstock_root/build_artifacts/packaging_1696202382185/work pandas==2.1.3 parso @ file:///home/conda/feedstock_root/build_artifacts/parso_1638334955874/work partd==1.4.1 peft==0.6.2 pexpect @ file:///home/conda/feedstock_root/build_artifacts/pexpect_1667297516076/work pickleshare @ file:///home/conda/feedstock_root/build_artifacts/pickleshare_1602536217715/work Pillow==10.1.0 platformdirs==3.11.0 preshed==3.0.9 prometheus-client==0.18.0 prompt-toolkit @ file:///home/conda/feedstock_root/build_artifacts/prompt-toolkit_1699963054032/work protobuf==3.20.3 psutil==5.9.4 ptitprince==0.2.7 ptyprocess @ file:///home/conda/feedstock_root/build_artifacts/ptyprocess_1609419310487/work/dist/ptyprocess-0.7.0-py2.py3-none-any.whl pure-eval @ file:///home/conda/feedstock_root/build_artifacts/pure_eval_1642875951954/work py==1.11.0 py-cpuinfo==9.0.0 py-spy==0.3.14 py4j==0.10.9.7 pyarrow==14.0.1 pyarrow-hotfix==0.5 pyasn1==0.5.0 pyasn1-modules==0.3.0 pydantic==1.10.13 Pygments @ file:///home/conda/feedstock_root/build_artifacts/pygments_1700320772037/work pynvml==11.5.0 pyparsing==3.1.1 pyrsistent==0.20.0 python-dateutil @ file:///home/conda/feedstock_root/build_artifacts/python-dateutil_1626286286081/work python-multipart==0.0.6 python-slugify==8.0.1 pytz==2023.3.post1 pyxlsb==1.0.10 PyYAML==6.0 pyzmq @ file:///croot/pyzmq_1686601365461/work ray==2.4.0 regex==2023.10.3 requests==2.31.0 requests-oauthlib==1.3.1 retry==0.9.2 rich==12.4.4 rsa==4.7.2 s3fs==0.4.2 s3transfer==0.7.0 sacremoses==0.1.1 safetensors==0.4.0 scikit-learn==1.3.2 scipy==1.11.4 seaborn==0.11.0 sentence-transformers==2.2.2 sentencepiece==0.1.99 six @ file:///home/conda/feedstock_root/build_artifacts/six_1620240208055/work smart-open==6.4.0 sniffio==1.3.0 soupsieve==2.5 spacy==3.7.2 spacy-legacy==3.0.12 spacy-loggers==1.0.5 srsly==2.4.8 stack-data @ file:///home/conda/feedstock_root/build_artifacts/stack_data_1669632077133/work starlette==0.27.0 sympy==1.12 tabulate==0.9.0 tblib==3.0.0 tensorboard==2.15.1 tensorboard-data-server==0.7.2 tensorboardX==2.2 text-unidecode==1.3 thinc==8.2.1 threadpoolctl==3.2.0 tokenizers==0.15.0 toolz==0.12.0 torch==2.1.1 torchaudio==2.1.1 torchdata==0.7.1 torchinfo==1.8.0 torchmetrics==0.11.4 torchtext==0.16.1 torchvision==0.16.1 tornado @ file:///home/conda/feedstock_root/build_artifacts/tornado_1648827254365/work tqdm==4.66.1 traitlets @ file:///home/conda/feedstock_root/build_artifacts/traitlets_1698671135544/work transformers==4.35.2 triton==2.1.0 typer==0.9.0 typing-inspect==0.9.0 typing_extensions @ file:///home/conda/feedstock_root/build_artifacts/typing_extensions_1695040754690/work tzdata==2023.3 urllib3==2.0.7 uvicorn==0.24.0.post1 virtualenv==20.21.0 wasabi==1.1.2 wcwidth @ file:///home/conda/feedstock_root/build_artifacts/wcwidth_1699959196938/work weasel==0.3.4 webencodings==0.5.1 Werkzeug==3.0.1 wrapt==1.16.0 xgboost==2.0.2 xgboost-ray==0.1.18 xlrd==2.0.1 XlsxWriter==3.1.9 xlwt==1.3.0 xxhash==3.4.1 yarl==1.9.2 zipp==3.17.0

model.yaml

model_type: llm base_model: /test/CodeLlama-7b-Python-hf

quantization: bits: 4

adapter: type: lora

prompt: template: |

Instruction:

{Instruction}

### Context:
{Context}

### Input:
{Input}

### Response:

input_features:

name: prompt type: text preprocessing: max_sequence_length: 2048

output_features:

name: Response type: text preprocessing: max_sequence_length: 2048

trainer: type: finetune learning_rate: 0.0001 batch_size: 1 max_batch_size: 1 gradient_accumulation_steps: 1 enable_gradient_checkpointing: true epochs: 3 learning_rate_scheduler: warmup_fraction: 0.01

preprocessing: sample_ratio: 1.0

backend: type: ray trainer: use_gpu: true num_workers: 2 resources_per_worker: CPU: 2 GPU: 1 strategy: type: ddp

Hello @alexsherstinsky - Kind follow-up on this thread. Is there any workaround to resolve this issue?

@SanjoySahaTigerAnalytics Yes, there was! We discussed this as a team, and I received direction for how to troubleshoot it in our own environment (containing the required number of GPUs). I am planning to do this starting tomorrow and into the next week. I will provide my findings for you here in the comments. Thank you very much for your patience.

Hello @alexsherstinsky - Thank you very much for prioritizing it. Will wait for your response.

Hello @alexsherstinsky - Kind follow-up on this thread. Please let us know in case there is any luck. Thank you in adv.

@SanjoySahaTigerAnalytics -- sorry for the delay; this has been escalated to the team. Someone will investigate and respond soon. Thank you again for your patience.

Hi @SanjoySahaTigerAnalytics! Apologies for the late response from our end. The reason you're running into issues is because 4 bit quantization isn't supported with DeepSpeed stage 3, which is what Ludwig defaults to when the zero_optimization_stage isn't specified in your config.

To solve this issue, there are three options in total, each of which have their own tradeoffs - the right solution will depend on your goal:

1. Set backend to local instead of Ray

model_type: llm
base_model: /root/CodeLlama-7b-Python-hf

quantization:
  bits: 4

adapter:
  type: lora

prompt:
  template: |
    ### Instruction:
    {Instruction}

    ### Context:
    {Context}

    ### Input:
    {Input}

    ### Response:

input_features:
  - name: prompt
    type: text
    preprocessing:
      max_sequence_length: 2048

output_features:
  - name: Response
    type: text
    preprocessing:
      max_sequence_length: 2048

trainer:
  type: finetune
  learning_rate: 0.0001
  batch_size: 1
  max_batch_size: 1
  gradient_accumulation_steps: 1
  enable_gradient_checkpointing: true
  epochs: 3
  learning_rate_scheduler:
    warmup_fraction: 0.01

backend:
  type: local

This will perform naive model parallel training, where your 4-bit Llama-2 model will be sharded across both of your GPUs, but it will not perform data parallel training. Training will likely be slower than training on just one of the two T4 GPUs you have because there's an overhead in passing intermediate states between GPU 1 and GPU 2 per forward and backward pass, however, this will not run into any issues and is the path that I recommend for now.

2. Use DeepSpeed Stage 3 without quantization

model_type: llm
base_model: /root/CodeLlama-7b-Python-hf

adapter:
  type: lora

prompt:
  template: |
    ### Instruction:
    {Instruction}

    ### Context:
    {Context}

    ### Input:
    {Input}

    ### Response:

input_features:
  - name: prompt
    type: text
    preprocessing:
      max_sequence_length: 2048

output_features:
  - name: Response
    type: text
    preprocessing:
      max_sequence_length: 2048

trainer:
  type: finetune
  learning_rate: 0.0001
  batch_size: 1
  max_batch_size: 1
  gradient_accumulation_steps: 1
  enable_gradient_checkpointing: true
  epochs: 3
  learning_rate_scheduler:
    warmup_fraction: 0.01

backend:
  type: ray
  trainer:
    use_gpu: true
    strategy:
      type: deepspeed
      zero_optimization:
        stage: 3
        offload_optimizer:
          device: cpu
          pin_memory: true
      bf16:
        enabled: true

This will perform data parallel + model parallel training across both of your GPUs. Under the surface, the way it works is that it shards your model across both GPU devices and also shards the data by the total number of workers. During each forward pass, there are a few all gather and all reduce operations to propagate model states to each of the GPUs during the forward pass, and similarly to compute gradients and update the weights during the backward pass. This can also be a bit slow, but works nicely for larger models.

The drawback here is, like I said earlier, that DeepSpeed Stage 3 unfortunately doesn't work with quantized models like 4-bit models. The reason is that Stage 3 does sharding of weights, but it seems opinionated on the fact that the data type of all layers must be the same, and it particularly doesn't like the nf4/int8 formats mixed with fp16 lora layers. For that reason, you'll notice that I removed the quantization part of the config.

3. Use 4-bit quantization with DeepSpeed Stage 2

model_type: llm
base_model: /root/CodeLlama-7b-Python-hf

quantization:
  bits: 4

adapter:
  type: lora

prompt:
  template: |
    ### Instruction:
    {Instruction}

    ### Context:
    {Context}

    ### Input:
    {Input}

    ### Response:

input_features:
  - name: prompt
    type: text
    preprocessing:
      max_sequence_length: 2048

output_features:
  - name: Response
    type: text
    preprocessing:
      max_sequence_length: 2048

trainer:
  type: finetune
  learning_rate: 0.0001
  batch_size: 1
  max_batch_size: 1
  gradient_accumulation_steps: 1
  enable_gradient_checkpointing: true
  epochs: 3
  learning_rate_scheduler:
    warmup_fraction: 0.01

backend:
  type: ray
  trainer:
    use_gpu: true
    strategy:
      type: deepspeed
      zero_optimization:
        stage: 2

DeepSpeed Stage 2 doesn't do any sharding of model weights - just the gradients and optimizer state. Since 4-bit quantized Llama-2-7b fits on a single T4 GPU, this will essentially do Distributed Data Parallel (DDP) styled training with the side benefits of being able to shard the gradients and optimizer states across GPUs as well (and the options to offload them to CPU as well if needed). This would be the ideal solution for doing Llama-7b in a machine with 2 T4 GPUs, but it is not currently supported by Ludwig.

We have an active PR (https://github.com/ludwig-ai/ludwig/pull/3728) that we're hoping to merge into Ludwig master by EOW this week or sometime early next week at the latest. Stay tuned!

Parting thoughts

I think that for now, I would recommend going with approach 1 and setting CUDA_VISIBLE_DEVICES to either just the single GPU or both GPUs depending on what you'd like - I expect that a single GPU will actually train faster in this case, but it is worth checking.

The last thing I want to mention is that in your config, you have max_sequence_length set to 2048 for both the input feature and the output feature, which means that the model will do forward passes on a maximum total of 4096 tokens (same as the Llama-2 context window in the base model). That may be fine in the case that you use the local backend with both of your T4 GPUs since you effectively get a lot more GPU VRAM available to you, but typically, a single T4 GPU can only fit a max sequence length of 2048 before it OOMs. That may be something to take a look at as well if you run into OOM errors.

Hope this helps unblock you!

Hello @arnavgarg1 - Thank you very much for looking into this. Will wait for Option 3 once you have the PR merged. For now I have tried Option 1 and Option 2. For both of them facing Error. Details below:

With Option 1, it failed due to error - CUDA OOM

I have changed the config as below (instead of using 2048 for input and output features, using 4096 including together) and using PYTORCH_CUDA_ALLOC_CONF - max_split_size_mb:128:

model_type: llm
base_model: /root/CodeLlama-7b-Python-hf

quantization:
  bits: 4

adapter:
  type: lora

prompt:
  template: |
    ### Instruction:
    {Instruction}

    ### Context:
    {Context}

    ### Input:
    {Input}

    ### Response:

input_features:
  - name: prompt
    type: text

output_features:
  - name: Response
    type: text

preprocessing:
    max_sequence_length: 4096

trainer:
  type: finetune
  learning_rate: 0.0001
  batch_size: 1
  max_batch_size: 1
  gradient_accumulation_steps: 1
  enable_gradient_checkpointing: true
  epochs: 3
  learning_rate_scheduler:
    warmup_fraction: 0.01

preprocessing:
  sample_ratio: 1.0

backend:
  type: local

The training process failed at 13%

╒════════════════════════╕
│ EXPERIMENT DESCRIPTION │
╘════════════════════════╛

╒══════════════════╤══════════════════════════════════════════════════════════════════════════════════════════════╕
│ Experiment name  │ api_experiment                                                                               │
├──────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ Model name       │ run                                                                                          │
├──────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ Output directory │ /root/train_llama_using_ludwig_exp2/results/api_experiment_run                               │
├──────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ ludwig_version   │ '0.9.dev'                                                                                    │
├──────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ command          │ ('/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ipykernel_launcher.py ' │
│                  │  '--f=/root/.local/share/jupyter/runtime/kernel-v2-17885gVhtM1z18wl.json')                   │
├──────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ random_seed      │ 42                                                                                           │
├──────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ data_format      │ "<class 'pandas.core.frame.DataFrame'>"                                                      │
├──────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ torch_version    │ '2.1.1+cu121'                                                                                │
├──────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ compute          │ {   'arch_list': [   'sm_50',                                                                │
│                  │                      'sm_60',                                                                │
│                  │                      'sm_70',                                                                │
│                  │                      'sm_75',                                                                │
│                  │                      'sm_80',                                                                │
│                  │                      'sm_86',                                                                │
│                  │                      'sm_90'],                                                               │
│                  │     'devices': {   0: {   'device_capability': (7, 5),                                       │
│                  │                           'device_properties': "_CudaDeviceProperties(name='Tesla "          │
│                  │                                                "T4', major=7, minor=5, "                     │
│                  │                                                'total_memory=14929MB, '                      │
│                  │                                                'multi_processor_count=40)',                  │
│                  │                           'gpu_type': 'Tesla T4'},                                           │
│                  │                    1: {   'device_capability': (7, 5),                                       │
│                  │                           'device_properties': "_CudaDeviceProperties(name='Tesla "          │
│                  │                                                "T4', major=7, minor=5, "                     │
│                  │                                                'total_memory=14929MB, '                      │
│                  │                                                'multi_processor_count=40)',                  │
│                  │                           'gpu_type': 'Tesla T4'}},                                          │
│                  │     'gencode_flags': '-gencode compute=compute_50,code=sm_50 -gencode '                      │
│                  │                      'compute=compute_60,code=sm_60 -gencode '                               │
│                  │                      'compute=compute_70,code=sm_70 -gencode '                               │
│                  │                      'compute=compute_75,code=sm_75 -gencode '                               │
│                  │                      'compute=compute_80,code=sm_80 -gencode '                               │
│                  │                      'compute=compute_86,code=sm_86 -gencode '                               │
│                  │                      'compute=compute_90,code=sm_90',                                        │
│                  │     'gpus_per_node': 2,                                                                      │
│                  │     'num_nodes': 1}                                                                          │
╘══════════════════╧══════════════════════════════════════════════════════════════════════════════════════════════╛

╒═══════════════╕
│ LUDWIG CONFIG │
╘═══════════════╛

User-specified config (with upgrades):

{   'adapter': {'type': 'lora'},
    'backend': {'type': 'local'},
    'base_model': '/root/CodeLlama-7b-Python-hf',
    'input_features': [{'name': 'prompt', 'type': 'text'}],
    'ludwig_version': '0.9.dev',
    'model_type': 'llm',
    'output_features': [{'name': 'Response', 'type': 'text'}],
    'preprocessing': {'sample_ratio': 1.0},
    'prompt': {   'template': '### Instruction:\n'
                              '{Instruction}\n'
                              '\n'
                              '### Context:\n'
                              '{Context}\n'
                              '\n'
                              '### Input:\n'
                              '{Input}\n'
                              '\n'
                              '### Response:\n'},
    'quantization': {'bits': 4},
    'trainer': {   'batch_size': 1,
                   'enable_gradient_checkpointing': True,
                   'epochs': 3,
                   'gradient_accumulation_steps': 1,
                   'learning_rate': 0.0001,
                   'learning_rate_scheduler': {'warmup_fraction': 0.01},
                   'max_batch_size': 1,
                   'type': 'finetune'}}

Full config saved to:
/root/train_llama_using_ludwig_exp2/results/api_experiment_run/api_experiment/model/model_hyperparameters.json

╒═══════════════╕
│ PREPROCESSING │
╘═══════════════╛

No cached dataset found at /root/train_llama_using_ludwig_exp2/50d5754098eb11ee995742010a800003.training.hdf5. Preprocessing the dataset.
Using full dataframe
Building dataset (it may take a while)
Loaded HuggingFace implementation of /root/CodeLlama-7b-Python-hf tokenizer
No padding token found. Using '[PAD]' as the pad token.
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Max length of feature 'None': 1034 (without start and stop symbols)
Max sequence length is 1034 for feature 'None'
Loaded HuggingFace implementation of /root/CodeLlama-7b-Python-hf tokenizer
No padding token found. Using '[PAD]' as the pad token.
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Max length of feature 'Response': 3672 (without start and stop symbols)
Max sequence length is 3672 for feature 'Response'
Loaded HuggingFace implementation of /root/CodeLlama-7b-Python-hf tokenizer
No padding token found. Using '[PAD]' as the pad token.
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Loaded HuggingFace implementation of /root/CodeLlama-7b-Python-hf tokenizer
No padding token found. Using '[PAD]' as the pad token.
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Building dataset: DONE
Writing preprocessed training set cache to /root/train_llama_using_ludwig_exp2/50d5754098eb11ee995742010a800003.training.hdf5
Writing preprocessed validation set cache to /root/train_llama_using_ludwig_exp2/50d5754098eb11ee995742010a800003.validation.hdf5
Writing preprocessed test set cache to /root/train_llama_using_ludwig_exp2/50d5754098eb11ee995742010a800003.test.hdf5
Writing train set metadata to /root/train_llama_using_ludwig_exp2/50d5754098eb11ee995742010a800003.meta.json

Dataset Statistics
╒════════════╤═══════════════╤════════════════════╕
│ Dataset    │   Size (Rows) │ Size (In Memory)   │
╞════════════╪═══════════════╪════════════════════╡
│ Training   │          2016 │ 472.62 Kb          │
├────────────┼───────────────┼────────────────────┤
│ Validation │           288 │ 67.62 Kb           │
├────────────┼───────────────┼────────────────────┤
│ Test       │           576 │ 135.12 Kb          │
╘════════════╧═══════════════╧════════════════════╛

╒═══════╕
│ MODEL │
╘═══════╛

Warnings and other logs:
Loading large language model...
Loading checkpoint shards: 100%|██████████| 3/3 [01:12<00:00, 24.29s/it]Done.
No padding token found. Using '[PAD]' as the pad token.

Loaded HuggingFace implementation of /root/CodeLlama-7b-Python-hf tokenizer
No padding token found. Using '[PAD]' as the pad token.
==================================================
Trainable Parameter Summary For Fine-Tuning
Fine-tuning with adapter: lora
trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.06220594176090199
==================================================
Gradient checkpointing enabled for training.

╒══════════╕
│ TRAINING │
╘══════════╛

Creating fresh model training run.
Training for 6048 step(s), approximately 3 epoch(s).
Early stopping policy: 5 round(s) of evaluation, or 10080 step(s), approximately 5 epoch(s).

Starting with step 0, epoch: 0
Training:   0%|          | 0/6048 [00:00<?, ?it/s]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
Training:  13%|█▎        | 771/6048 [1:00:15<3:21:40,  2.29s/it, loss=0.0453]

{
    "name": "OutOfMemoryError",
    "message": "CUDA out of memory. Tried to allocate 2.19 GiB. GPU 1 has a total capacty of 14.58 GiB of which 1.39 GiB is free. Including non-PyTorch memory, this process has 13.18 GiB memory in use. Of the allocated memory 12.22 GiB is allocated by PyTorch, and 846.60 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF",
    "stack": "---------------------------------------------------------------------------
OutOfMemoryError                          Traceback (most recent call last)
Cell In[8], line 1
----> 1 results = model.train(dataset=df)

File ~/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ludwig/api.py:677, in LudwigModel.train(self, dataset, training_set, validation_set, test_set, training_set_metadata, data_format, experiment_name, model_name, model_resume_path, skip_save_training_description, skip_save_training_statistics, skip_save_model, skip_save_progress, skip_save_log, skip_save_processed_input, output_directory, random_seed, **kwargs)
    670     callback.on_train_start(
    671         model=self.model,
    672         config=self.config_obj.to_dict(),
    673         config_fp=self.config_fp,
    674     )
    676 try:
--> 677     train_stats = trainer.train(
    678         training_set,
    679         validation_set=validation_set,
    680         test_set=test_set,
    681         save_path=model_dir,
    682     )
    683     (self.model, train_trainset_stats, train_valiset_stats, train_testset_stats) = train_stats
    685     # Calibrates output feature probabilities on validation set if calibration is enabled.
    686     # Must be done after training, and before final model parameters are saved.

File ~/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ludwig/trainers/trainer.py:970, in Trainer.train(self, training_set, validation_set, test_set, save_path, return_state_dict, **kwargs)
    967 self.callback(lambda c: c.on_epoch_start(self, progress_tracker, save_path))
    969 # Trains over a full epoch of data or up to the last training step, whichever is sooner.
--> 970 should_break = self._train_loop(
    971     batcher,
    972     progress_tracker,
    973     save_path,
    974     train_summary_writer,
    975     progress_bar,
    976     training_set,
    977     validation_set,
    978     test_set,
    979     start_time,
    980     validation_summary_writer,
    981     test_summary_writer,
    982     model_hyperparameters_path,
    983     output_features,
    984     metrics_names,
    985     checkpoint_manager,
    986     final_steps_per_checkpoint,
    987     early_stopping_steps,
    988     profiler,
    989 )
    990 if self.is_coordinator():
    991     # ========== Save training progress ==========
    992     logger.debug(
    993         f\"Epoch {progress_tracker.epoch} took: \"
    994         f\"{time_utils.strdelta((time.time() - start_time) * 1000.0)}.\"
    995     )

File ~/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ludwig/trainers/trainer.py:1144, in Trainer._train_loop(self, batcher, progress_tracker, save_path, train_summary_writer, progress_bar, training_set, validation_set, test_set, start_time, validation_summary_writer, test_summary_writer, model_hyperparameters_path, output_features, metrics_names, checkpoint_manager, final_steps_per_checkpoint, early_stopping_steps, profiler)
   1135 inputs = {
   1136     i_feat.feature_name: torch.from_numpy(np.array(batch[i_feat.proc_column], copy=True)).to(self.device)
   1137     for i_feat in self.model.input_features.values()
   1138 }
   1139 targets = {
   1140     o_feat.feature_name: torch.from_numpy(np.array(batch[o_feat.proc_column], copy=True)).to(self.device)
   1141     for o_feat in self.model.output_features.values()
   1142 }
-> 1144 loss, all_losses = self.train_step(inputs, targets, should_step=should_step, profiler=profiler)
   1146 # Update LR schduler here instead of train loop to avoid updating during batch size tuning, etc.
   1147 self.scheduler.step()

File ~/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ludwig/trainers/trainer.py:315, in Trainer.train_step(self, inputs, targets, should_step, profiler)
    313     self.scaler.scale(loss).backward()
    314 else:
--> 315     self.distributed.backward(loss, self.dist_model)
    317 if not should_step:
    318     # Short-circuit the parameter updates if we are still accumulating gradients
    319     return loss, all_losses

File ~/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ludwig/distributed/base.py:58, in DistributedStrategy.backward(self, loss, model)
     57 def backward(self, loss: torch.Tensor, model: nn.Module):
---> 58     loss.backward()

File ~/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/torch/_tensor.py:492, in Tensor.backward(self, gradient, retain_graph, create_graph, inputs)
    482 if has_torch_function_unary(self):
    483     return handle_torch_function(
    484         Tensor.backward,
    485         (self,),
   (...)
    490         inputs=inputs,
    491     )
--> 492 torch.autograd.backward(
    493     self, gradient, retain_graph, create_graph, inputs=inputs
    494 )

File ~/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/torch/autograd/__init__.py:251, in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
    246     retain_graph = create_graph
    248 # The reason we repeat the same comment below is that
    249 # some Python versions print out the first line of a multi-line function
    250 # calls in the traceback and some print out the last line
--> 251 Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
    252     tensors,
    253     grad_tensors_,
    254     retain_graph,
    255     create_graph,
    256     inputs,
    257     allow_unreachable=True,
    258     accumulate_grad=True,
    259 )

File ~/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/torch/autograd/function.py:288, in BackwardCFunction.apply(self, *args)
    282     raise RuntimeError(
    283         \"Implementing both 'backward' and 'vjp' for a custom \"
    284         \"Function is not allowed. You should only implement one \"
    285         \"of them.\"
    286     )
    287 user_fn = vjp_fn if vjp_fn is not Function.vjp else backward_fn
--> 288 return user_fn(self, *args)

File ~/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/torch/utils/checkpoint.py:288, in CheckpointFunction.backward(ctx, *args)
    283 if len(outputs_with_grad) == 0:
    284     raise RuntimeError(
    285         \"none of output has requires_grad=True,\"
    286         \" this checkpoint() is not necessary\"
    287     )
--> 288 torch.autograd.backward(outputs_with_grad, args_with_grad)
    289 grads = tuple(
    290     inp.grad if isinstance(inp, torch.Tensor) else None
    291     for inp in detached_inputs
    292 )
    294 return (None, None) + grads

File ~/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/torch/autograd/__init__.py:251, in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
    246     retain_graph = create_graph
    248 # The reason we repeat the same comment below is that
    249 # some Python versions print out the first line of a multi-line function
    250 # calls in the traceback and some print out the last line
--> 251 Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
    252     tensors,
    253     grad_tensors_,
    254     retain_graph,
    255     create_graph,
    256     inputs,
    257     allow_unreachable=True,
    258     accumulate_grad=True,
    259 )

OutOfMemoryError: CUDA out of memory. Tried to allocate 2.19 GiB. GPU 1 has a total capacty of 14.58 GiB of which 1.39 GiB is free. Including non-PyTorch memory, this process has 13.18 GiB memory in use. Of the allocated memory 12.22 GiB is allocated by PyTorch, and 846.60 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF"
}

With Option 2, it failed due to error - RecursionError: maximum recursion depth exceeded in comparison

2023-12-12 09:27:39,257 WARNING util.py:244 -- The `start_trial` operation took 0.692 s, which may be a performance bottleneck.
[2m[36m(TrainTrainable pid=7142)[0m /root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
[2m[36m(TrainTrainable pid=7142)[0m   warn("The installed version of bitsandbytes was compiled without GPU support. "
[2m[36m(TrainTrainable pid=7142)[0m /root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32
[2m[36m(RayTrainWorker pid=7207)[0m 2023-12-12 09:27:57,361  INFO config.py:86 -- Setting up process group for: env:// [rank=0, world_size=2]
[2m[36m(TorchTrainer pid=7142)[0m 2023-12-12 09:27:57,498    INFO streaming_executor.py:83 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[ReadParquet->RandomizeBlockOrder]
[2m[36m(TorchTrainer pid=7142)[0m 2023-12-12 09:27:57,499    INFO streaming_executor.py:84 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
[2m[36m(TorchTrainer pid=7142)[0m 2023-12-12 09:27:59,040    INFO streaming_executor.py:83 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[ReadParquet->RandomizeBlockOrder]
[2m[36m(TorchTrainer pid=7142)[0m 2023-12-12 09:27:59,041    INFO streaming_executor.py:84 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
                                                           :00<?, ?it/s]                                                     :00<?, ?it/s]
[2m[36m(pid=7324) [0mStage 0 0:   0%|          | 0/1 [00:09<?, ?it/s][2m[36m(RayTrainWorker pid=7207)[0m [2023-12-12 09:28:08,709] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)

[2m[36m(pid=7324) [0mStage 0 0:   0%|          | 0/1 [00:12<?, ?it/s]2023-12-12 09:28:11,658 WARNING worker.py:1986 -- Traceback (most recent call last):
  File "python/ray/_raylet.pyx", line 870, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 921, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 877, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 881, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 821, in ray._raylet.execute_task.function_executor
  File "/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ray/_private/function_manager.py", line 670, in actor_method_executor
    return method(__ray_actor, *args, **kwargs)
  File "/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ray/util/tracing/tracing_helper.py", line 460, in _resume_span
    return method(self, *_args, **_kwargs)
  File "/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ray/train/_internal/worker_group.py", line 31, in __execute
    raise skipped from exception_cause(skipped)
  File "/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ray/train/_internal/utils.py", line 129, in discard_return_wrapper
    train_func(*args, **kwargs)
  File "/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ludwig/backend/ray.py", line 501, in <lambda>
    lambda config: train_fn(**config),
  File "/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ludwig/backend/ray.py", line 193, in train_fn
    val_shard = RayDatasetShard(val_shard, features, training_set_metadata)
  File "/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ludwig/data/dataset/ray.py", line 244, in __init__
    self.create_epoch_iter()
  File "/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ludwig/data/dataset/ray.py", line 268, in create_epoch_iter
    self.epoch_iter = self.dataset_shard.repeat().iter_epochs()
  File "/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ray/data/_internal/dataset_iterator/dataset_iterator_impl.py", line 48, in __getattr__
    raise DeprecationWarning(
DeprecationWarning: session.get_dataset_shard returns a ray.data.DatasetIterator instead of a Dataset/DatasetPipeline as of Ray v2.3. Use iter_torch_batches(), to_tf(), or iter_batches() to iterate over one epoch. See https://docs.ray.io/en/latest/data/api/dataset_iterator.html for full DatasetIterator docs.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 733, in dump
    return Pickler.dump(self, obj)
  File "/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 826, in reducer_override
    if sys.version_info[:2] < (3, 7) and _is_parametrized_type_hint(
RecursionError: maximum recursion depth exceeded in comparison

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "python/ray/_raylet.pyx", line 1197, in ray._raylet.task_execution_handler
  File "python/ray/_raylet.pyx", line 1100, in ray._raylet.execute_task_with_cancellation_handler
  File "python/ray/_raylet.pyx", line 823, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 1001, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 623, in ray._raylet.store_task_errors
  File "python/ray/_raylet.pyx", line 2563, in ray._raylet.CoreWorker.store_task_outputs
  File "/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ray/_private/serialization.py", line 466, in serialize
    return self._serialize_to_msgpack(value)
  File "/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ray/_private/serialization.py", line 421, in _serialize_to_msgpack
    value = value.to_bytes()
  File "/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ray/exceptions.py", line 32, in to_bytes
    serialized_exception=pickle.dumps(self),
  File "/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 88, in dumps
    cp.dump(obj)
  File "/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 739, in dump
    raise pickle.PicklingError(msg) from e
_pickle.PicklingError: Could not pickle object as excessively deep recursion required.
An unexpected internal error occurred while the worker was executing a task.
2023-12-12 09:28:11,664 WARNING worker.py:1986 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff3af2f4d99454008df9ee8ed701000000 Worker ID: 6b97c810d37673028c94be550f07d7d2ce73b6d2f867004f1d13eb3a Node ID: 6ea58fc55f420c9b016bd501b6eb841ea09722c353633d82fc917ba6 Worker IP address: 10.128.0.3 Worker port: 35699 Worker PID: 7208 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker exits unexpectedly. Worker exits with an exit code None.
 Traceback (most recent call last):
  File "python/ray/_raylet.pyx", line 870, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 921, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 877, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 881, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 821, in ray._raylet.execute_task.function_executor
  File "/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ray/_private/function_manager.py", line 670, in actor_method_executor
    return method(__ray_actor, *args, **kwargs)
  File "/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ray/util/tracing/tracing_helper.py", line 460, in _resume_span
    return method(self, *_args, **_kwargs)
  File "/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ray/train/_internal/worker_group.py", line 31, in __execute
    raise skipped from exception_cause(skipped)
  File "/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ray/train/_internal/utils.py", line 129, in discard_return_wrapper
    train_func(*args, **kwargs)
  File "/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ludwig/backend/ray.py", line 501, in <lambda>
    lambda config: train_fn(**config),
  File "/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ludwig/backend/ray.py", line 193, in train_fn
    val_shard = RayDatasetShard(val_shard, features, training_set_metadata)
  File "/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ludwig/data/dataset/ray.py", line 244, in __init__
    self.create_epoch_iter()
  File "/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ludwig/data/dataset/ray.py", line 268, in create_epoch_iter
    self.epoch_iter = self.dataset_shard.repeat().iter_epochs()
  File "/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ray/data/_internal/dataset_iterator/dataset_iterator_impl.py", line 48, in __getattr__
    raise DeprecationWarning(
DeprecationWarning: session.get_dataset_shard returns a ray.data.DatasetIterator instead of a Dataset/DatasetPipeline as of Ray v2.3. Use iter_torch_batches(), to_tf(), or iter_batches() to iterate over one epoch. See https://docs.ray.io/en/latest/data/api/dataset_iterator.html for full DatasetIterator docs.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 733, in dump
    return Pickler.dump(self, obj)
  File "/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 826, in reducer_override
    if sys.version_info[:2] < (3, 7) and _is_parametrized_type_hint(
RecursionError: maximum recursion depth exceeded in comparison

Thanks for reporting results back! I may know the cause for both, but just want to check - may I ask what version of Torch and Ray you're using?

Thanks @arnavgarg1 for quick response. Torch --> 2.1.1 Ray --> 2.4.0

Below is my Conda Environment:


# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main  
_openmp_mutex             5.1                       1_gnu  
absl-py                   2.0.0                    pypi_0    pypi
accelerate                0.25.0                   pypi_0    pypi
aiohttp                   3.9.1                    pypi_0    pypi
aiohttp-cors              0.7.0                    pypi_0    pypi
aiorwlock                 1.3.0                    pypi_0    pypi
aiosignal                 1.3.1                    pypi_0    pypi
anyio                     3.7.1                    pypi_0    pypi
asttokens                 2.4.1              pyhd8ed1ab_0    conda-forge
async-timeout             4.0.3                    pypi_0    pypi
attrs                     23.1.0                   pypi_0    pypi
awscli                    1.31.12                  pypi_0    pypi
beautifulsoup4            4.12.2                   pypi_0    pypi
bitsandbytes              0.40.2                   pypi_0    pypi
bleach                    6.1.0                    pypi_0    pypi
blessed                   1.20.0                   pypi_0    pypi
blinker                   1.7.0                    pypi_0    pypi
blis                      0.7.11                   pypi_0    pypi
botocore                  1.33.12                  pypi_0    pypi
brotli                    1.1.0                    pypi_0    pypi
bzip2                     1.0.8                h7b6447c_0  
ca-certificates           2023.11.17           hbcca054_0    conda-forge
cachetools                5.3.2                    pypi_0    pypi
captum                    0.7.0                    pypi_0    pypi
catalogue                 2.0.10                   pypi_0    pypi
certifi                   2023.11.17               pypi_0    pypi
charset-normalizer        3.3.2                    pypi_0    pypi
click                     8.1.7                    pypi_0    pypi
cloudpathlib              0.16.0                   pypi_0    pypi
cloudpickle               3.0.0                    pypi_0    pypi
colorama                  0.4.4                    pypi_0    pypi
colorful                  0.5.5                    pypi_0    pypi
comm                      0.1.4              pyhd8ed1ab_0    conda-forge
commonmark                0.9.1                    pypi_0    pypi
confection                0.1.4                    pypi_0    pypi
contourpy                 1.2.0                    pypi_0    pypi
cycler                    0.12.1                   pypi_0    pypi
cymem                     2.0.8                    pypi_0    pypi
cython                    3.0.6                    pypi_0    pypi
dask                      2023.3.2                 pypi_0    pypi
dataclasses-json          0.6.3                    pypi_0    pypi
datasets                  2.15.0                   pypi_0    pypi
debugpy                   1.6.7           py310h6a678d5_0  
decorator                 5.1.1              pyhd8ed1ab_0    conda-forge
deepspeed                 0.12.4                   pypi_0    pypi
dill                      0.3.7                    pypi_0    pypi
distlib                   0.3.8                    pypi_0    pypi
docutils                  0.16                     pypi_0    pypi
entrypoints               0.4                pyhd8ed1ab_0    conda-forge
et-xmlfile                1.1.0                    pypi_0    pypi
exceptiongroup            1.2.0              pyhd8ed1ab_0    conda-forge
executing                 2.0.1              pyhd8ed1ab_0    conda-forge
faiss-cpu                 1.7.4                    pypi_0    pypi
fastapi                   0.105.0                  pypi_0    pypi
filelock                  3.13.1                   pypi_0    pypi
flask                     3.0.0                    pypi_0    pypi
flask-compress            1.14                     pypi_0    pypi
fonttools                 4.46.0                   pypi_0    pypi
frozenlist                1.4.0                    pypi_0    pypi
fsspec                    2023.10.0                pypi_0    pypi
future                    0.18.3                   pypi_0    pypi
getdaft                   0.1.20                   pypi_0    pypi
google-api-core           2.15.0                   pypi_0    pypi
google-auth               2.25.2                   pypi_0    pypi
google-auth-oauthlib      1.1.0                    pypi_0    pypi
googleapis-common-protos  1.62.0                   pypi_0    pypi
gpustat                   1.1.1                    pypi_0    pypi
gputil                    1.4.0                    pypi_0    pypi
grpcio                    1.51.3                   pypi_0    pypi
h11                       0.14.0                   pypi_0    pypi
h5py                      3.10.0                   pypi_0    pypi
hiplot                    0.1.33                   pypi_0    pypi
hjson                     3.1.0                    pypi_0    pypi
html5lib                  1.1                      pypi_0    pypi
httpcore                  1.0.2                    pypi_0    pypi
httpx                     0.25.2                   pypi_0    pypi
huggingface-hub           0.19.4                   pypi_0    pypi
hummingbird-ml            0.4.9                    pypi_0    pypi
hyperopt                  0.2.7                    pypi_0    pypi
idna                      3.6                      pypi_0    pypi
imagecodecs               2023.9.18                pypi_0    pypi
importlib-metadata        7.0.0                    pypi_0    pypi
ipykernel                 6.26.0             pyhf8b6a83_0    conda-forge
ipython                   8.18.1             pyh707e725_3    conda-forge
itsdangerous              2.1.2                    pypi_0    pypi
jedi                      0.19.1             pyhd8ed1ab_0    conda-forge
jinja2                    3.1.2                    pypi_0    pypi
jmespath                  1.0.1                    pypi_0    pypi
joblib                    1.3.2                    pypi_0    pypi
jsonschema                4.6.2                    pypi_0    pypi
jupyter_client            7.3.4              pyhd8ed1ab_0    conda-forge
jupyter_core              5.5.0           py310hff52083_0    conda-forge
kaggle                    1.5.16                   pypi_0    pypi
kiwisolver                1.4.5                    pypi_0    pypi
langcodes                 3.3.0                    pypi_0    pypi
ld_impl_linux-64          2.38                 h1181459_1  
libffi                    3.4.4                h6a678d5_0  
libgcc-ng                 11.2.0               h1234567_1  
libgfortran-ng            7.5.0               ha8ba4b0_17  
libgfortran4              7.5.0               ha8ba4b0_17  
libgomp                   11.2.0               h1234567_1  
libsodium                 1.0.18               h36c2ea0_1    conda-forge
libstdcxx-ng              11.2.0               h1234567_1  
libuuid                   1.41.5               h5eee18b_0  
lightgbm                  4.1.0                    pypi_0    pypi
lightgbm-ray              0.1.9                    pypi_0    pypi
locket                    1.0.0                    pypi_0    pypi
loguru                    0.7.2                    pypi_0    pypi
loralib                   0.1.2                    pypi_0    pypi
ludwig                    0.9.dev0                 pypi_0    pypi
lxml                      4.9.3                    pypi_0    pypi
markdown                  3.5.1                    pypi_0    pypi
markupsafe                2.1.3                    pypi_0    pypi
marshmallow               3.20.1                   pypi_0    pypi
marshmallow-dataclass     8.5.4                    pypi_0    pypi
marshmallow-jsonschema    0.13.0                   pypi_0    pypi
matplotlib                3.8.2                    pypi_0    pypi
matplotlib-inline         0.1.6              pyhd8ed1ab_0    conda-forge
mpi                       1.0                       mpich  
mpi4py                    3.1.4           py310hfc96bbd_0  
mpich                     3.3.2                hc856adb_0  
mpmath                    1.3.0                    pypi_0    pypi
msgpack                   1.0.7                    pypi_0    pypi
multidict                 6.0.4                    pypi_0    pypi
multiprocess              0.70.15                  pypi_0    pypi
murmurhash                1.0.10                   pypi_0    pypi
mypy-extensions           1.0.0                    pypi_0    pypi
ncurses                   6.4                  h6a678d5_0  
nest-asyncio              1.5.8              pyhd8ed1ab_0    conda-forge
networkx                  3.2.1                    pypi_0    pypi
ninja                     1.11.1.1                 pypi_0    pypi
nltk                      3.8.1                    pypi_0    pypi
numpy                     1.26.2                   pypi_0    pypi
nvidia-cublas-cu12        12.1.3.1                 pypi_0    pypi
nvidia-cuda-cupti-cu12    12.1.105                 pypi_0    pypi
nvidia-cuda-nvrtc-cu12    12.1.105                 pypi_0    pypi
nvidia-cuda-runtime-cu12  12.1.105                 pypi_0    pypi
nvidia-cudnn-cu12         8.9.2.26                 pypi_0    pypi
nvidia-cufft-cu12         11.0.2.54                pypi_0    pypi
nvidia-curand-cu12        10.3.2.106               pypi_0    pypi
nvidia-cusolver-cu12      11.4.5.107               pypi_0    pypi
nvidia-cusparse-cu12      12.1.0.106               pypi_0    pypi
nvidia-ml-py              12.535.133               pypi_0    pypi
nvidia-nccl-cu12          2.18.1                   pypi_0    pypi
nvidia-nvjitlink-cu12     12.3.101                 pypi_0    pypi
nvidia-nvtx-cu12          12.1.105                 pypi_0    pypi
oauthlib                  3.2.2                    pypi_0    pypi
onnx                      1.15.0                   pypi_0    pypi
onnxconverter-common      1.13.0                   pypi_0    pypi
opencensus                0.11.3                   pypi_0    pypi
opencensus-context        0.1.3                    pypi_0    pypi
openpyxl                  3.1.2                    pypi_0    pypi
openssl                   3.0.12               h7f8727e_0  
packaging                 23.2               pyhd8ed1ab_0    conda-forge
pandas                    2.1.4                    pypi_0    pypi
parso                     0.8.3              pyhd8ed1ab_0    conda-forge
partd                     1.4.1                    pypi_0    pypi
peft                      0.7.0                    pypi_0    pypi
pexpect                   4.9.0                    pypi_0    pypi
pickleshare               0.7.5                   py_1003    conda-forge
pillow                    10.1.0                   pypi_0    pypi
pip                       23.3.1          py310h06a4308_0  
platformdirs              3.11.0                   pypi_0    pypi
preshed                   3.0.9                    pypi_0    pypi
prometheus-client         0.19.0                   pypi_0    pypi
prompt-toolkit            3.0.41             pyha770c72_0    conda-forge
protobuf                  3.20.3                   pypi_0    pypi
psutil                    5.9.4                    pypi_0    pypi
ptitprince                0.2.7                    pypi_0    pypi
ptyprocess                0.7.0              pyhd3deb0d_0    conda-forge
pure_eval                 0.2.2              pyhd8ed1ab_0    conda-forge
py                        1.11.0                   pypi_0    pypi
py-cpuinfo                9.0.0                    pypi_0    pypi
py-spy                    0.3.14                   pypi_0    pypi
py4j                      0.10.9.7                 pypi_0    pypi
pyarrow                   14.0.1                   pypi_0    pypi
pyarrow-hotfix            0.6                      pypi_0    pypi
pyasn1                    0.5.1                    pypi_0    pypi
pyasn1-modules            0.3.0                    pypi_0    pypi
pydantic                  1.10.13                  pypi_0    pypi
pygments                  2.17.2             pyhd8ed1ab_0    conda-forge
pynvml                    11.5.0                   pypi_0    pypi
pyparsing                 3.1.1                    pypi_0    pypi
pyrsistent                0.20.0                   pypi_0    pypi
python                    3.10.13              h955ad1f_0  
python-dateutil           2.8.2              pyhd8ed1ab_0    conda-forge
python-multipart          0.0.6                    pypi_0    pypi
python-slugify            8.0.1                    pypi_0    pypi
python_abi                3.10                    2_cp310    conda-forge
pytz                      2023.3.post1             pypi_0    pypi
pyxlsb                    1.0.10                   pypi_0    pypi
pyyaml                    6.0                      pypi_0    pypi
pyzmq                     25.1.0          py310h6a678d5_0  
ray                       2.4.0                    pypi_0    pypi
readline                  8.2                  h5eee18b_0  
regex                     2023.10.3                pypi_0    pypi
requests                  2.31.0                   pypi_0    pypi
requests-oauthlib         1.3.1                    pypi_0    pypi
retry                     0.9.2                    pypi_0    pypi
rich                      12.4.4                   pypi_0    pypi
rsa                       4.7.2                    pypi_0    pypi
s3fs                      0.4.2                    pypi_0    pypi
s3transfer                0.8.2                    pypi_0    pypi
sacremoses                0.1.1                    pypi_0    pypi
safetensors               0.4.1                    pypi_0    pypi
scikit-learn              1.3.2                    pypi_0    pypi
scipy                     1.11.4                   pypi_0    pypi
seaborn                   0.11.0                   pypi_0    pypi
sentence-transformers     2.2.2                    pypi_0    pypi
sentencepiece             0.1.99                   pypi_0    pypi
setuptools                68.0.0          py310h06a4308_0  
six                       1.16.0             pyh6c4a22f_0    conda-forge
smart-open                6.4.0                    pypi_0    pypi
sniffio                   1.3.0                    pypi_0    pypi
soupsieve                 2.5                      pypi_0    pypi
spacy                     3.7.2                    pypi_0    pypi
spacy-legacy              3.0.12                   pypi_0    pypi
spacy-loggers             1.0.5                    pypi_0    pypi
sqlite                    3.41.2               h5eee18b_0  
srsly                     2.4.8                    pypi_0    pypi
stack-data                0.6.3                    pypi_0    pypi
stack_data                0.6.2              pyhd8ed1ab_0    conda-forge
starlette                 0.27.0                   pypi_0    pypi
sympy                     1.12                     pypi_0    pypi
tabulate                  0.9.0                    pypi_0    pypi
tblib                     3.0.0                    pypi_0    pypi
tensorboard               2.15.1                   pypi_0    pypi
tensorboard-data-server   0.7.2                    pypi_0    pypi
tensorboardx              2.2                      pypi_0    pypi
text-unidecode            1.3                      pypi_0    pypi
thinc                     8.2.1                    pypi_0    pypi
threadpoolctl             3.2.0                    pypi_0    pypi
tifffile                  2023.12.9                pypi_0    pypi
tk                        8.6.12               h1ccaba5_0  
tokenizers                0.15.0                   pypi_0    pypi
toolz                     0.12.0                   pypi_0    pypi
torch                     2.1.1                    pypi_0    pypi
torchaudio                2.1.1                    pypi_0    pypi
torchdata                 0.7.1                    pypi_0    pypi
torchinfo                 1.8.0                    pypi_0    pypi
torchmetrics              0.11.4                   pypi_0    pypi
torchtext                 0.16.1                   pypi_0    pypi
torchvision               0.16.1                   pypi_0    pypi
tornado                   6.1             py310h5764c6d_3    conda-forge
tqdm                      4.66.1                   pypi_0    pypi
traitlets                 5.14.0             pyhd8ed1ab_0    conda-forge
transformers              4.35.2                   pypi_0    pypi
triton                    2.1.0                    pypi_0    pypi
typer                     0.9.0                    pypi_0    pypi
typing-inspect            0.9.0                    pypi_0    pypi
typing_extensions         4.9.0              pyha770c72_0    conda-forge
tzdata                    2023.3                   pypi_0    pypi
urllib3                   2.0.7                    pypi_0    pypi
uvicorn                   0.24.0.post1             pypi_0    pypi
virtualenv                20.21.0                  pypi_0    pypi
wasabi                    1.1.2                    pypi_0    pypi
wcwidth                   0.2.12             pyhd8ed1ab_0    conda-forge
weasel                    0.3.4                    pypi_0    pypi
webencodings              0.5.1                    pypi_0    pypi
werkzeug                  3.0.1                    pypi_0    pypi
wheel                     0.41.2          py310h06a4308_0  
wrapt                     1.16.0                   pypi_0    pypi
xgboost                   2.0.2                    pypi_0    pypi
xgboost-ray               0.1.18                   pypi_0    pypi
xlrd                      2.0.1                    pypi_0    pypi
xlsxwriter                3.1.9                    pypi_0    pypi
xlwt                      1.3.0                    pypi_0    pypi
xxhash                    3.4.1                    pypi_0    pypi
xz                        5.4.5                h5eee18b_0  
yarl                      1.9.4                    pypi_0    pypi
zeromq                    4.3.4                h2531618_0  
zipp                      3.17.0                   pypi_0    pypi
zlib                      1.2.13               h5eee18b_0

Hello @arnavgarg1 - Kind follow-up on this. In the meantime when I executed with below config, the process completed successfully with both the infra configuration

Single T4 GPU
Two T4 GPUs

Is there anyway to have the training successful with max_sequence_length set to 4096 (Merging both input and output)? As you mentioned with Single GPU it will support till 2048. But with 4096 context length, is it achievable via Multi GPU?

Config

model_type: llm
base_model: /root/CodeLlama-7b-Python-hf

quantization:
  bits: 4

adapter:
  type: lora

prompt:
  template: |
    ### Instruction:
    {Instruction}

    ### Context:
    {Context}

    ### Input:
    {Input}

    ### Response:

input_features:
  - name: prompt
    type: text
    preprocessing:
      max_sequence_length: 2048

output_features:
  - name: Response
    type: text
    preprocessing:
      max_sequence_length: 2048

trainer:
  type: finetune
  learning_rate: 0.0001
  batch_size: 1
  max_batch_size: 1
  gradient_accumulation_steps: 1
  enable_gradient_checkpointing: true
  epochs: 3
  learning_rate_scheduler:
    warmup_fraction: 0.01

preprocessing:
  sample_ratio: 1.0

backend:
  type: local

still ray doesn't work for quantization ? any idea ?

ludwig-ai / ludwig

Unable to train the llama-7b in a machine with two Tesla T4 GPU's using Ray #3783