Closed Ragul-Ramdass closed 1 month ago
Hi @Ragul-Ramdass -- thank you for reporting this issue and the one in #3784 -- please give us a few business days to look into it and get back to you. Thank you.
I'm facing the exact same issue with both the strategies - deepspeed and ddp. Below is the conda environment, and model.yaml for reference:
absl-py==2.0.0 accelerate==0.24.1 aiohttp==3.9.0 aiohttp-cors==0.7.0 aiorwlock==1.3.0 aiosignal==1.3.1 anyio==3.7.1 asttokens @ file:///home/conda/feedstock_root/build_artifacts/asttokens_1698341106958/work async-timeout==4.0.3 attrs==23.1.0 awscli==1.30.3 backports.functools-lru-cache @ file:///home/conda/feedstock_root/build_artifacts/backports.functools_lru_cache_1687772187254/work beautifulsoup4==4.12.2 bitsandbytes==0.40.2 bleach==6.1.0 blessed==1.20.0 blinker==1.7.0 blis==0.7.11 botocore==1.32.3 Brotli==1.1.0 cachetools==5.3.2 captum==0.6.0 catalogue==2.0.10 certifi==2023.11.17 charset-normalizer==3.3.2 click==8.1.7 cloudpathlib==0.16.0 cloudpickle==3.0.0 colorama==0.4.4 colorful==0.5.5 comm @ file:///home/conda/feedstock_root/build_artifacts/comm_1691044910542/work commonmark==0.9.1 confection==0.1.3 contourpy==1.2.0 cycler==0.12.1 cymem==2.0.8 Cython==3.0.5 dask==2023.3.2 dataclasses-json==0.6.2 datasets==2.15.0 debugpy @ file:///croot/debugpy_1690905042057/work decorator @ file:///home/conda/feedstock_root/build_artifacts/decorator_1641555617451/work deepspeed==0.12.3 dill==0.3.7 distlib==0.3.7 docutils==0.16 entrypoints @ file:///home/conda/feedstock_root/build_artifacts/entrypoints_1643888246732/work et-xmlfile==1.1.0 exceptiongroup @ file:///home/conda/feedstock_root/build_artifacts/exceptiongroup_1692026125334/work executing @ file:///home/conda/feedstock_root/build_artifacts/executing_1698579936712/work faiss-cpu==1.7.4 fastapi==0.104.1 filelock==3.13.1 Flask==3.0.0 Flask-Compress==1.14 fonttools==4.44.3 frozenlist==1.4.0 fsspec==2023.9.2 future==0.18.3 getdaft==0.1.20 google-api-core==2.14.0 google-auth==2.23.4 google-auth-oauthlib==1.1.0 googleapis-common-protos==1.61.0 gpustat==1.1.1 GPUtil==1.4.0 grpcio==1.51.3 h11==0.14.0 h5py==3.10.0 hiplot==0.1.33 hjson==3.1.0 html5lib==1.1 httpcore==1.0.2 httpx==0.25.1 huggingface-hub==0.19.4 hummingbird-ml==0.4.9 hyperopt==0.2.7 idna==3.4 importlib-metadata==6.8.0 ipykernel @ file:///home/conda/feedstock_root/build_artifacts/ipykernel_1698244021190/work ipython @ file:///home/conda/feedstock_root/build_artifacts/ipython_1698846603011/work itsdangerous==2.1.2 jedi @ file:///home/conda/feedstock_root/build_artifacts/jedi_1696326070614/work Jinja2==3.1.2 jmespath==1.0.1 joblib==1.3.2 jsonschema==4.6.2 jupyter-client @ file:///home/conda/feedstock_root/build_artifacts/jupyter_client_1654730843242/work jupyter_core @ file:///home/conda/feedstock_root/build_artifacts/jupyter_core_1698673647019/work kaggle==1.5.16 kiwisolver==1.4.5 langcodes==3.3.0 lightgbm==4.1.0 lightgbm-ray==0.1.9 locket==1.0.0 loguru==0.7.2 loralib==0.1.2 ludwig @ git+https://github.com/ludwig-ai/ludwig.git@8c47c3cb16a972e0c27818a2124a3e0359142ca0 lxml==4.9.3 Markdown==3.5.1 MarkupSafe==2.1.3 marshmallow==3.20.1 marshmallow-dataclass==8.5.4 marshmallow-jsonschema==0.13.0 matplotlib==3.8.2 matplotlib-inline @ file:///home/conda/feedstock_root/build_artifacts/matplotlib-inline_1660814786464/work mpi4py @ file:///croot/mpi4py_1671223370575/work mpmath==1.3.0 msgpack==1.0.7 multidict==6.0.4 multiprocess==0.70.15 murmurhash==1.0.10 mypy-extensions==1.0.0 nest-asyncio @ file:///home/conda/feedstock_root/build_artifacts/nest-asyncio_1697083700168/work networkx==3.2.1 ninja==1.11.1.1 nltk==3.8.1 numpy==1.26.2 nvidia-cublas-cu12==12.1.3.1 nvidia-cuda-cupti-cu12==12.1.105 nvidia-cuda-nvrtc-cu12==12.1.105 nvidia-cuda-runtime-cu12==12.1.105 nvidia-cudnn-cu12==8.9.2.26 nvidia-cufft-cu12==11.0.2.54 nvidia-curand-cu12==10.3.2.106 nvidia-cusolver-cu12==11.4.5.107 nvidia-cusparse-cu12==12.1.0.106 nvidia-ml-py==12.535.133 nvidia-nccl-cu12==2.18.1 nvidia-nvjitlink-cu12==12.3.101 nvidia-nvtx-cu12==12.1.105 oauthlib==3.2.2 onnx==1.15.0 onnxconverter-common==1.13.0 opencensus==0.11.3 opencensus-context==0.1.3 openpyxl==3.1.2 packaging @ file:///home/conda/feedstock_root/build_artifacts/packaging_1696202382185/work pandas==2.1.3 parso @ file:///home/conda/feedstock_root/build_artifacts/parso_1638334955874/work partd==1.4.1 peft==0.6.2 pexpect @ file:///home/conda/feedstock_root/build_artifacts/pexpect_1667297516076/work pickleshare @ file:///home/conda/feedstock_root/build_artifacts/pickleshare_1602536217715/work Pillow==10.1.0 platformdirs==3.11.0 preshed==3.0.9 prometheus-client==0.18.0 prompt-toolkit @ file:///home/conda/feedstock_root/build_artifacts/prompt-toolkit_1699963054032/work protobuf==3.20.3 psutil==5.9.4 ptitprince==0.2.7 ptyprocess @ file:///home/conda/feedstock_root/build_artifacts/ptyprocess_1609419310487/work/dist/ptyprocess-0.7.0-py2.py3-none-any.whl pure-eval @ file:///home/conda/feedstock_root/build_artifacts/pure_eval_1642875951954/work py==1.11.0 py-cpuinfo==9.0.0 py-spy==0.3.14 py4j==0.10.9.7 pyarrow==14.0.1 pyarrow-hotfix==0.5 pyasn1==0.5.0 pyasn1-modules==0.3.0 pydantic==1.10.13 Pygments @ file:///home/conda/feedstock_root/build_artifacts/pygments_1700320772037/work pynvml==11.5.0 pyparsing==3.1.1 pyrsistent==0.20.0 python-dateutil @ file:///home/conda/feedstock_root/build_artifacts/python-dateutil_1626286286081/work python-multipart==0.0.6 python-slugify==8.0.1 pytz==2023.3.post1 pyxlsb==1.0.10 PyYAML==6.0 pyzmq @ file:///croot/pyzmq_1686601365461/work ray==2.4.0 regex==2023.10.3 requests==2.31.0 requests-oauthlib==1.3.1 retry==0.9.2 rich==12.4.4 rsa==4.7.2 s3fs==0.4.2 s3transfer==0.7.0 sacremoses==0.1.1 safetensors==0.4.0 scikit-learn==1.3.2 scipy==1.11.4 seaborn==0.11.0 sentence-transformers==2.2.2 sentencepiece==0.1.99 six @ file:///home/conda/feedstock_root/build_artifacts/six_1620240208055/work smart-open==6.4.0 sniffio==1.3.0 soupsieve==2.5 spacy==3.7.2 spacy-legacy==3.0.12 spacy-loggers==1.0.5 srsly==2.4.8 stack-data @ file:///home/conda/feedstock_root/build_artifacts/stack_data_1669632077133/work starlette==0.27.0 sympy==1.12 tabulate==0.9.0 tblib==3.0.0 tensorboard==2.15.1 tensorboard-data-server==0.7.2 tensorboardX==2.2 text-unidecode==1.3 thinc==8.2.1 threadpoolctl==3.2.0 tokenizers==0.15.0 toolz==0.12.0 torch==2.1.1 torchaudio==2.1.1 torchdata==0.7.1 torchinfo==1.8.0 torchmetrics==0.11.4 torchtext==0.16.1 torchvision==0.16.1 tornado @ file:///home/conda/feedstock_root/build_artifacts/tornado_1648827254365/work tqdm==4.66.1 traitlets @ file:///home/conda/feedstock_root/build_artifacts/traitlets_1698671135544/work transformers==4.35.2 triton==2.1.0 typer==0.9.0 typing-inspect==0.9.0 typing_extensions @ file:///home/conda/feedstock_root/build_artifacts/typing_extensions_1695040754690/work tzdata==2023.3 urllib3==2.0.7 uvicorn==0.24.0.post1 virtualenv==20.21.0 wasabi==1.1.2 wcwidth @ file:///home/conda/feedstock_root/build_artifacts/wcwidth_1699959196938/work weasel==0.3.4 webencodings==0.5.1 Werkzeug==3.0.1 wrapt==1.16.0 xgboost==2.0.2 xgboost-ray==0.1.18 xlrd==2.0.1 XlsxWriter==3.1.9 xlwt==1.3.0 xxhash==3.4.1 yarl==1.9.2 zipp==3.17.0
model_type: llm base_model: /test/CodeLlama-7b-Python-hf
quantization: bits: 4
adapter: type: lora
prompt: template: |
{Instruction}
### Context:
{Context}
### Input:
{Input}
### Response:
input_features:
output_features:
trainer: type: finetune learning_rate: 0.0001 batch_size: 1 max_batch_size: 1 gradient_accumulation_steps: 1 enable_gradient_checkpointing: true epochs: 3 learning_rate_scheduler: warmup_fraction: 0.01
preprocessing: sample_ratio: 1.0
backend: type: ray trainer: use_gpu: true num_workers: 2 resources_per_worker: CPU: 2 GPU: 1 strategy: type: ddp
Hello @alexsherstinsky - Kind follow-up on this thread. Is there any workaround to resolve this issue?
@SanjoySahaTigerAnalytics Yes, there was! We discussed this as a team, and I received direction for how to troubleshoot it in our own environment (containing the required number of GPUs). I am planning to do this starting tomorrow and into the next week. I will provide my findings for you here in the comments. Thank you very much for your patience.
Hello @alexsherstinsky - Thank you very much for prioritizing it. Will wait for your response.
Hello @alexsherstinsky - Kind follow-up on this thread. Please let us know in case there is any luck. Thank you in adv.
@SanjoySahaTigerAnalytics -- sorry for the delay; this has been escalated to the team. Someone will investigate and respond soon. Thank you again for your patience.
Hi @SanjoySahaTigerAnalytics! Apologies for the late response from our end. The reason you're running into issues is because 4 bit quantization isn't supported with DeepSpeed stage 3, which is what Ludwig defaults to when the zero_optimization_stage
isn't specified in your config.
To solve this issue, there are three options in total, each of which have their own tradeoffs - the right solution will depend on your goal:
model_type: llm
base_model: /root/CodeLlama-7b-Python-hf
quantization:
bits: 4
adapter:
type: lora
prompt:
template: |
### Instruction:
{Instruction}
### Context:
{Context}
### Input:
{Input}
### Response:
input_features:
- name: prompt
type: text
preprocessing:
max_sequence_length: 2048
output_features:
- name: Response
type: text
preprocessing:
max_sequence_length: 2048
trainer:
type: finetune
learning_rate: 0.0001
batch_size: 1
max_batch_size: 1
gradient_accumulation_steps: 1
enable_gradient_checkpointing: true
epochs: 3
learning_rate_scheduler:
warmup_fraction: 0.01
backend:
type: local
This will perform naive model parallel training, where your 4-bit Llama-2 model will be sharded across both of your GPUs, but it will not perform data parallel training. Training will likely be slower than training on just one of the two T4 GPUs you have because there's an overhead in passing intermediate states between GPU 1 and GPU 2 per forward and backward pass, however, this will not run into any issues and is the path that I recommend for now.
model_type: llm
base_model: /root/CodeLlama-7b-Python-hf
adapter:
type: lora
prompt:
template: |
### Instruction:
{Instruction}
### Context:
{Context}
### Input:
{Input}
### Response:
input_features:
- name: prompt
type: text
preprocessing:
max_sequence_length: 2048
output_features:
- name: Response
type: text
preprocessing:
max_sequence_length: 2048
trainer:
type: finetune
learning_rate: 0.0001
batch_size: 1
max_batch_size: 1
gradient_accumulation_steps: 1
enable_gradient_checkpointing: true
epochs: 3
learning_rate_scheduler:
warmup_fraction: 0.01
backend:
type: ray
trainer:
use_gpu: true
strategy:
type: deepspeed
zero_optimization:
stage: 3
offload_optimizer:
device: cpu
pin_memory: true
bf16:
enabled: true
This will perform data parallel + model parallel training across both of your GPUs. Under the surface, the way it works is that it shards your model across both GPU devices and also shards the data by the total number of workers. During each forward pass, there are a few all gather and all reduce operations to propagate model states to each of the GPUs during the forward pass, and similarly to compute gradients and update the weights during the backward pass. This can also be a bit slow, but works nicely for larger models.
The drawback here is, like I said earlier, that DeepSpeed Stage 3 unfortunately doesn't work with quantized models like 4-bit models. The reason is that Stage 3 does sharding of weights, but it seems opinionated on the fact that the data type of all layers must be the same, and it particularly doesn't like the nf4/int8 formats mixed with fp16 lora layers. For that reason, you'll notice that I removed the quantization
part of the config.
model_type: llm
base_model: /root/CodeLlama-7b-Python-hf
quantization:
bits: 4
adapter:
type: lora
prompt:
template: |
### Instruction:
{Instruction}
### Context:
{Context}
### Input:
{Input}
### Response:
input_features:
- name: prompt
type: text
preprocessing:
max_sequence_length: 2048
output_features:
- name: Response
type: text
preprocessing:
max_sequence_length: 2048
trainer:
type: finetune
learning_rate: 0.0001
batch_size: 1
max_batch_size: 1
gradient_accumulation_steps: 1
enable_gradient_checkpointing: true
epochs: 3
learning_rate_scheduler:
warmup_fraction: 0.01
backend:
type: ray
trainer:
use_gpu: true
strategy:
type: deepspeed
zero_optimization:
stage: 2
DeepSpeed Stage 2 doesn't do any sharding of model weights - just the gradients and optimizer state. Since 4-bit quantized Llama-2-7b fits on a single T4 GPU, this will essentially do Distributed Data Parallel (DDP) styled training with the side benefits of being able to shard the gradients and optimizer states across GPUs as well (and the options to offload them to CPU as well if needed). This would be the ideal solution for doing Llama-7b in a machine with 2 T4 GPUs, but it is not currently supported by Ludwig.
We have an active PR (https://github.com/ludwig-ai/ludwig/pull/3728) that we're hoping to merge into Ludwig master by EOW this week or sometime early next week at the latest. Stay tuned!
I think that for now, I would recommend going with approach 1 and setting CUDA_VISIBLE_DEVICES to either just the single GPU or both GPUs depending on what you'd like - I expect that a single GPU will actually train faster in this case, but it is worth checking.
The last thing I want to mention is that in your config, you have max_sequence_length
set to 2048 for both the input feature and the output feature, which means that the model will do forward passes on a maximum total of 4096 tokens (same as the Llama-2 context window in the base model). That may be fine in the case that you use the local backend with both of your T4 GPUs since you effectively get a lot more GPU VRAM available to you, but typically, a single T4 GPU can only fit a max sequence length of 2048 before it OOMs. That may be something to take a look at as well if you run into OOM errors.
Hope this helps unblock you!
Hello @arnavgarg1 - Thank you very much for looking into this. Will wait for Option 3 once you have the PR merged. For now I have tried Option 1 and Option 2. For both of them facing Error. Details below:
I have changed the config as below (instead of using 2048 for input and output features, using 4096 including together) and using PYTORCH_CUDA_ALLOC_CONF - max_split_size_mb:128:
model_type: llm
base_model: /root/CodeLlama-7b-Python-hf
quantization:
bits: 4
adapter:
type: lora
prompt:
template: |
### Instruction:
{Instruction}
### Context:
{Context}
### Input:
{Input}
### Response:
input_features:
- name: prompt
type: text
output_features:
- name: Response
type: text
preprocessing:
max_sequence_length: 4096
trainer:
type: finetune
learning_rate: 0.0001
batch_size: 1
max_batch_size: 1
gradient_accumulation_steps: 1
enable_gradient_checkpointing: true
epochs: 3
learning_rate_scheduler:
warmup_fraction: 0.01
preprocessing:
sample_ratio: 1.0
backend:
type: local
The training process failed at 13%
╒════════════════════════╕
│ EXPERIMENT DESCRIPTION │
╘════════════════════════╛
╒══════════════════╤══════════════════════════════════════════════════════════════════════════════════════════════╕
│ Experiment name │ api_experiment │
├──────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ Model name │ run │
├──────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ Output directory │ /root/train_llama_using_ludwig_exp2/results/api_experiment_run │
├──────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ ludwig_version │ '0.9.dev' │
├──────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ command │ ('/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ipykernel_launcher.py ' │
│ │ '--f=/root/.local/share/jupyter/runtime/kernel-v2-17885gVhtM1z18wl.json') │
├──────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ random_seed │ 42 │
├──────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ data_format │ "<class 'pandas.core.frame.DataFrame'>" │
├──────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ torch_version │ '2.1.1+cu121' │
├──────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ compute │ { 'arch_list': [ 'sm_50', │
│ │ 'sm_60', │
│ │ 'sm_70', │
│ │ 'sm_75', │
│ │ 'sm_80', │
│ │ 'sm_86', │
│ │ 'sm_90'], │
│ │ 'devices': { 0: { 'device_capability': (7, 5), │
│ │ 'device_properties': "_CudaDeviceProperties(name='Tesla " │
│ │ "T4', major=7, minor=5, " │
│ │ 'total_memory=14929MB, ' │
│ │ 'multi_processor_count=40)', │
│ │ 'gpu_type': 'Tesla T4'}, │
│ │ 1: { 'device_capability': (7, 5), │
│ │ 'device_properties': "_CudaDeviceProperties(name='Tesla " │
│ │ "T4', major=7, minor=5, " │
│ │ 'total_memory=14929MB, ' │
│ │ 'multi_processor_count=40)', │
│ │ 'gpu_type': 'Tesla T4'}}, │
│ │ 'gencode_flags': '-gencode compute=compute_50,code=sm_50 -gencode ' │
│ │ 'compute=compute_60,code=sm_60 -gencode ' │
│ │ 'compute=compute_70,code=sm_70 -gencode ' │
│ │ 'compute=compute_75,code=sm_75 -gencode ' │
│ │ 'compute=compute_80,code=sm_80 -gencode ' │
│ │ 'compute=compute_86,code=sm_86 -gencode ' │
│ │ 'compute=compute_90,code=sm_90', │
│ │ 'gpus_per_node': 2, │
│ │ 'num_nodes': 1} │
╘══════════════════╧══════════════════════════════════════════════════════════════════════════════════════════════╛
╒═══════════════╕
│ LUDWIG CONFIG │
╘═══════════════╛
User-specified config (with upgrades):
{ 'adapter': {'type': 'lora'},
'backend': {'type': 'local'},
'base_model': '/root/CodeLlama-7b-Python-hf',
'input_features': [{'name': 'prompt', 'type': 'text'}],
'ludwig_version': '0.9.dev',
'model_type': 'llm',
'output_features': [{'name': 'Response', 'type': 'text'}],
'preprocessing': {'sample_ratio': 1.0},
'prompt': { 'template': '### Instruction:\n'
'{Instruction}\n'
'\n'
'### Context:\n'
'{Context}\n'
'\n'
'### Input:\n'
'{Input}\n'
'\n'
'### Response:\n'},
'quantization': {'bits': 4},
'trainer': { 'batch_size': 1,
'enable_gradient_checkpointing': True,
'epochs': 3,
'gradient_accumulation_steps': 1,
'learning_rate': 0.0001,
'learning_rate_scheduler': {'warmup_fraction': 0.01},
'max_batch_size': 1,
'type': 'finetune'}}
Full config saved to:
/root/train_llama_using_ludwig_exp2/results/api_experiment_run/api_experiment/model/model_hyperparameters.json
╒═══════════════╕
│ PREPROCESSING │
╘═══════════════╛
No cached dataset found at /root/train_llama_using_ludwig_exp2/50d5754098eb11ee995742010a800003.training.hdf5. Preprocessing the dataset.
Using full dataframe
Building dataset (it may take a while)
Loaded HuggingFace implementation of /root/CodeLlama-7b-Python-hf tokenizer
No padding token found. Using '[PAD]' as the pad token.
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Max length of feature 'None': 1034 (without start and stop symbols)
Max sequence length is 1034 for feature 'None'
Loaded HuggingFace implementation of /root/CodeLlama-7b-Python-hf tokenizer
No padding token found. Using '[PAD]' as the pad token.
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Max length of feature 'Response': 3672 (without start and stop symbols)
Max sequence length is 3672 for feature 'Response'
Loaded HuggingFace implementation of /root/CodeLlama-7b-Python-hf tokenizer
No padding token found. Using '[PAD]' as the pad token.
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Loaded HuggingFace implementation of /root/CodeLlama-7b-Python-hf tokenizer
No padding token found. Using '[PAD]' as the pad token.
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Building dataset: DONE
Writing preprocessed training set cache to /root/train_llama_using_ludwig_exp2/50d5754098eb11ee995742010a800003.training.hdf5
Writing preprocessed validation set cache to /root/train_llama_using_ludwig_exp2/50d5754098eb11ee995742010a800003.validation.hdf5
Writing preprocessed test set cache to /root/train_llama_using_ludwig_exp2/50d5754098eb11ee995742010a800003.test.hdf5
Writing train set metadata to /root/train_llama_using_ludwig_exp2/50d5754098eb11ee995742010a800003.meta.json
Dataset Statistics
╒════════════╤═══════════════╤════════════════════╕
│ Dataset │ Size (Rows) │ Size (In Memory) │
╞════════════╪═══════════════╪════════════════════╡
│ Training │ 2016 │ 472.62 Kb │
├────────────┼───────────────┼────────────────────┤
│ Validation │ 288 │ 67.62 Kb │
├────────────┼───────────────┼────────────────────┤
│ Test │ 576 │ 135.12 Kb │
╘════════════╧═══════════════╧════════════════════╛
╒═══════╕
│ MODEL │
╘═══════╛
Warnings and other logs:
Loading large language model...
Loading checkpoint shards: 100%|██████████| 3/3 [01:12<00:00, 24.29s/it]Done.
No padding token found. Using '[PAD]' as the pad token.
Loaded HuggingFace implementation of /root/CodeLlama-7b-Python-hf tokenizer
No padding token found. Using '[PAD]' as the pad token.
==================================================
Trainable Parameter Summary For Fine-Tuning
Fine-tuning with adapter: lora
trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.06220594176090199
==================================================
Gradient checkpointing enabled for training.
╒══════════╕
│ TRAINING │
╘══════════╛
Creating fresh model training run.
Training for 6048 step(s), approximately 3 epoch(s).
Early stopping policy: 5 round(s) of evaluation, or 10080 step(s), approximately 5 epoch(s).
Starting with step 0, epoch: 0
Training: 0%| | 0/6048 [00:00<?, ?it/s]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
warnings.warn(
Training: 13%|█▎ | 771/6048 [1:00:15<3:21:40, 2.29s/it, loss=0.0453]
{
"name": "OutOfMemoryError",
"message": "CUDA out of memory. Tried to allocate 2.19 GiB. GPU 1 has a total capacty of 14.58 GiB of which 1.39 GiB is free. Including non-PyTorch memory, this process has 13.18 GiB memory in use. Of the allocated memory 12.22 GiB is allocated by PyTorch, and 846.60 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF",
"stack": "---------------------------------------------------------------------------
OutOfMemoryError Traceback (most recent call last)
Cell In[8], line 1
----> 1 results = model.train(dataset=df)
File ~/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ludwig/api.py:677, in LudwigModel.train(self, dataset, training_set, validation_set, test_set, training_set_metadata, data_format, experiment_name, model_name, model_resume_path, skip_save_training_description, skip_save_training_statistics, skip_save_model, skip_save_progress, skip_save_log, skip_save_processed_input, output_directory, random_seed, **kwargs)
670 callback.on_train_start(
671 model=self.model,
672 config=self.config_obj.to_dict(),
673 config_fp=self.config_fp,
674 )
676 try:
--> 677 train_stats = trainer.train(
678 training_set,
679 validation_set=validation_set,
680 test_set=test_set,
681 save_path=model_dir,
682 )
683 (self.model, train_trainset_stats, train_valiset_stats, train_testset_stats) = train_stats
685 # Calibrates output feature probabilities on validation set if calibration is enabled.
686 # Must be done after training, and before final model parameters are saved.
File ~/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ludwig/trainers/trainer.py:970, in Trainer.train(self, training_set, validation_set, test_set, save_path, return_state_dict, **kwargs)
967 self.callback(lambda c: c.on_epoch_start(self, progress_tracker, save_path))
969 # Trains over a full epoch of data or up to the last training step, whichever is sooner.
--> 970 should_break = self._train_loop(
971 batcher,
972 progress_tracker,
973 save_path,
974 train_summary_writer,
975 progress_bar,
976 training_set,
977 validation_set,
978 test_set,
979 start_time,
980 validation_summary_writer,
981 test_summary_writer,
982 model_hyperparameters_path,
983 output_features,
984 metrics_names,
985 checkpoint_manager,
986 final_steps_per_checkpoint,
987 early_stopping_steps,
988 profiler,
989 )
990 if self.is_coordinator():
991 # ========== Save training progress ==========
992 logger.debug(
993 f\"Epoch {progress_tracker.epoch} took: \"
994 f\"{time_utils.strdelta((time.time() - start_time) * 1000.0)}.\"
995 )
File ~/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ludwig/trainers/trainer.py:1144, in Trainer._train_loop(self, batcher, progress_tracker, save_path, train_summary_writer, progress_bar, training_set, validation_set, test_set, start_time, validation_summary_writer, test_summary_writer, model_hyperparameters_path, output_features, metrics_names, checkpoint_manager, final_steps_per_checkpoint, early_stopping_steps, profiler)
1135 inputs = {
1136 i_feat.feature_name: torch.from_numpy(np.array(batch[i_feat.proc_column], copy=True)).to(self.device)
1137 for i_feat in self.model.input_features.values()
1138 }
1139 targets = {
1140 o_feat.feature_name: torch.from_numpy(np.array(batch[o_feat.proc_column], copy=True)).to(self.device)
1141 for o_feat in self.model.output_features.values()
1142 }
-> 1144 loss, all_losses = self.train_step(inputs, targets, should_step=should_step, profiler=profiler)
1146 # Update LR schduler here instead of train loop to avoid updating during batch size tuning, etc.
1147 self.scheduler.step()
File ~/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ludwig/trainers/trainer.py:315, in Trainer.train_step(self, inputs, targets, should_step, profiler)
313 self.scaler.scale(loss).backward()
314 else:
--> 315 self.distributed.backward(loss, self.dist_model)
317 if not should_step:
318 # Short-circuit the parameter updates if we are still accumulating gradients
319 return loss, all_losses
File ~/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ludwig/distributed/base.py:58, in DistributedStrategy.backward(self, loss, model)
57 def backward(self, loss: torch.Tensor, model: nn.Module):
---> 58 loss.backward()
File ~/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/torch/_tensor.py:492, in Tensor.backward(self, gradient, retain_graph, create_graph, inputs)
482 if has_torch_function_unary(self):
483 return handle_torch_function(
484 Tensor.backward,
485 (self,),
(...)
490 inputs=inputs,
491 )
--> 492 torch.autograd.backward(
493 self, gradient, retain_graph, create_graph, inputs=inputs
494 )
File ~/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/torch/autograd/__init__.py:251, in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
246 retain_graph = create_graph
248 # The reason we repeat the same comment below is that
249 # some Python versions print out the first line of a multi-line function
250 # calls in the traceback and some print out the last line
--> 251 Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
252 tensors,
253 grad_tensors_,
254 retain_graph,
255 create_graph,
256 inputs,
257 allow_unreachable=True,
258 accumulate_grad=True,
259 )
File ~/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/torch/autograd/function.py:288, in BackwardCFunction.apply(self, *args)
282 raise RuntimeError(
283 \"Implementing both 'backward' and 'vjp' for a custom \"
284 \"Function is not allowed. You should only implement one \"
285 \"of them.\"
286 )
287 user_fn = vjp_fn if vjp_fn is not Function.vjp else backward_fn
--> 288 return user_fn(self, *args)
File ~/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/torch/utils/checkpoint.py:288, in CheckpointFunction.backward(ctx, *args)
283 if len(outputs_with_grad) == 0:
284 raise RuntimeError(
285 \"none of output has requires_grad=True,\"
286 \" this checkpoint() is not necessary\"
287 )
--> 288 torch.autograd.backward(outputs_with_grad, args_with_grad)
289 grads = tuple(
290 inp.grad if isinstance(inp, torch.Tensor) else None
291 for inp in detached_inputs
292 )
294 return (None, None) + grads
File ~/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/torch/autograd/__init__.py:251, in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
246 retain_graph = create_graph
248 # The reason we repeat the same comment below is that
249 # some Python versions print out the first line of a multi-line function
250 # calls in the traceback and some print out the last line
--> 251 Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
252 tensors,
253 grad_tensors_,
254 retain_graph,
255 create_graph,
256 inputs,
257 allow_unreachable=True,
258 accumulate_grad=True,
259 )
OutOfMemoryError: CUDA out of memory. Tried to allocate 2.19 GiB. GPU 1 has a total capacty of 14.58 GiB of which 1.39 GiB is free. Including non-PyTorch memory, this process has 13.18 GiB memory in use. Of the allocated memory 12.22 GiB is allocated by PyTorch, and 846.60 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF"
}
2023-12-12 09:27:39,257 WARNING util.py:244 -- The `start_trial` operation took 0.692 s, which may be a performance bottleneck.
[2m[36m(TrainTrainable pid=7142)[0m /root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
[2m[36m(TrainTrainable pid=7142)[0m warn("The installed version of bitsandbytes was compiled without GPU support. "
[2m[36m(TrainTrainable pid=7142)[0m /root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32
[2m[36m(RayTrainWorker pid=7207)[0m 2023-12-12 09:27:57,361 INFO config.py:86 -- Setting up process group for: env:// [rank=0, world_size=2]
[2m[36m(TorchTrainer pid=7142)[0m 2023-12-12 09:27:57,498 INFO streaming_executor.py:83 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[ReadParquet->RandomizeBlockOrder]
[2m[36m(TorchTrainer pid=7142)[0m 2023-12-12 09:27:57,499 INFO streaming_executor.py:84 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
[2m[36m(TorchTrainer pid=7142)[0m 2023-12-12 09:27:59,040 INFO streaming_executor.py:83 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[ReadParquet->RandomizeBlockOrder]
[2m[36m(TorchTrainer pid=7142)[0m 2023-12-12 09:27:59,041 INFO streaming_executor.py:84 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
:00<?, ?it/s] :00<?, ?it/s]
[2m[36m(pid=7324) [0mStage 0 0: 0%| | 0/1 [00:09<?, ?it/s][2m[36m(RayTrainWorker pid=7207)[0m [2023-12-12 09:28:08,709] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2m[36m(pid=7324) [0mStage 0 0: 0%| | 0/1 [00:12<?, ?it/s]2023-12-12 09:28:11,658 WARNING worker.py:1986 -- Traceback (most recent call last):
File "python/ray/_raylet.pyx", line 870, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 921, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 877, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 881, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 821, in ray._raylet.execute_task.function_executor
File "/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ray/_private/function_manager.py", line 670, in actor_method_executor
return method(__ray_actor, *args, **kwargs)
File "/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ray/util/tracing/tracing_helper.py", line 460, in _resume_span
return method(self, *_args, **_kwargs)
File "/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ray/train/_internal/worker_group.py", line 31, in __execute
raise skipped from exception_cause(skipped)
File "/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ray/train/_internal/utils.py", line 129, in discard_return_wrapper
train_func(*args, **kwargs)
File "/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ludwig/backend/ray.py", line 501, in <lambda>
lambda config: train_fn(**config),
File "/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ludwig/backend/ray.py", line 193, in train_fn
val_shard = RayDatasetShard(val_shard, features, training_set_metadata)
File "/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ludwig/data/dataset/ray.py", line 244, in __init__
self.create_epoch_iter()
File "/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ludwig/data/dataset/ray.py", line 268, in create_epoch_iter
self.epoch_iter = self.dataset_shard.repeat().iter_epochs()
File "/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ray/data/_internal/dataset_iterator/dataset_iterator_impl.py", line 48, in __getattr__
raise DeprecationWarning(
DeprecationWarning: session.get_dataset_shard returns a ray.data.DatasetIterator instead of a Dataset/DatasetPipeline as of Ray v2.3. Use iter_torch_batches(), to_tf(), or iter_batches() to iterate over one epoch. See https://docs.ray.io/en/latest/data/api/dataset_iterator.html for full DatasetIterator docs.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 733, in dump
return Pickler.dump(self, obj)
File "/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 826, in reducer_override
if sys.version_info[:2] < (3, 7) and _is_parametrized_type_hint(
RecursionError: maximum recursion depth exceeded in comparison
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "python/ray/_raylet.pyx", line 1197, in ray._raylet.task_execution_handler
File "python/ray/_raylet.pyx", line 1100, in ray._raylet.execute_task_with_cancellation_handler
File "python/ray/_raylet.pyx", line 823, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 1001, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 623, in ray._raylet.store_task_errors
File "python/ray/_raylet.pyx", line 2563, in ray._raylet.CoreWorker.store_task_outputs
File "/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ray/_private/serialization.py", line 466, in serialize
return self._serialize_to_msgpack(value)
File "/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ray/_private/serialization.py", line 421, in _serialize_to_msgpack
value = value.to_bytes()
File "/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ray/exceptions.py", line 32, in to_bytes
serialized_exception=pickle.dumps(self),
File "/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 88, in dumps
cp.dump(obj)
File "/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 739, in dump
raise pickle.PicklingError(msg) from e
_pickle.PicklingError: Could not pickle object as excessively deep recursion required.
An unexpected internal error occurred while the worker was executing a task.
2023-12-12 09:28:11,664 WARNING worker.py:1986 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff3af2f4d99454008df9ee8ed701000000 Worker ID: 6b97c810d37673028c94be550f07d7d2ce73b6d2f867004f1d13eb3a Node ID: 6ea58fc55f420c9b016bd501b6eb841ea09722c353633d82fc917ba6 Worker IP address: 10.128.0.3 Worker port: 35699 Worker PID: 7208 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker exits unexpectedly. Worker exits with an exit code None.
Traceback (most recent call last):
File "python/ray/_raylet.pyx", line 870, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 921, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 877, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 881, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 821, in ray._raylet.execute_task.function_executor
File "/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ray/_private/function_manager.py", line 670, in actor_method_executor
return method(__ray_actor, *args, **kwargs)
File "/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ray/util/tracing/tracing_helper.py", line 460, in _resume_span
return method(self, *_args, **_kwargs)
File "/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ray/train/_internal/worker_group.py", line 31, in __execute
raise skipped from exception_cause(skipped)
File "/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ray/train/_internal/utils.py", line 129, in discard_return_wrapper
train_func(*args, **kwargs)
File "/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ludwig/backend/ray.py", line 501, in <lambda>
lambda config: train_fn(**config),
File "/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ludwig/backend/ray.py", line 193, in train_fn
val_shard = RayDatasetShard(val_shard, features, training_set_metadata)
File "/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ludwig/data/dataset/ray.py", line 244, in __init__
self.create_epoch_iter()
File "/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ludwig/data/dataset/ray.py", line 268, in create_epoch_iter
self.epoch_iter = self.dataset_shard.repeat().iter_epochs()
File "/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ray/data/_internal/dataset_iterator/dataset_iterator_impl.py", line 48, in __getattr__
raise DeprecationWarning(
DeprecationWarning: session.get_dataset_shard returns a ray.data.DatasetIterator instead of a Dataset/DatasetPipeline as of Ray v2.3. Use iter_torch_batches(), to_tf(), or iter_batches() to iterate over one epoch. See https://docs.ray.io/en/latest/data/api/dataset_iterator.html for full DatasetIterator docs.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 733, in dump
return Pickler.dump(self, obj)
File "/root/anaconda3/envs/ludwig_train_env/lib/python3.10/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 826, in reducer_override
if sys.version_info[:2] < (3, 7) and _is_parametrized_type_hint(
RecursionError: maximum recursion depth exceeded in comparison
Thanks for reporting results back! I may know the cause for both, but just want to check - may I ask what version of Torch and Ray you're using?
Thanks @arnavgarg1 for quick response. Torch --> 2.1.1 Ray --> 2.4.0
Below is my Conda Environment:
# Name Version Build Channel
_libgcc_mutex 0.1 main
_openmp_mutex 5.1 1_gnu
absl-py 2.0.0 pypi_0 pypi
accelerate 0.25.0 pypi_0 pypi
aiohttp 3.9.1 pypi_0 pypi
aiohttp-cors 0.7.0 pypi_0 pypi
aiorwlock 1.3.0 pypi_0 pypi
aiosignal 1.3.1 pypi_0 pypi
anyio 3.7.1 pypi_0 pypi
asttokens 2.4.1 pyhd8ed1ab_0 conda-forge
async-timeout 4.0.3 pypi_0 pypi
attrs 23.1.0 pypi_0 pypi
awscli 1.31.12 pypi_0 pypi
beautifulsoup4 4.12.2 pypi_0 pypi
bitsandbytes 0.40.2 pypi_0 pypi
bleach 6.1.0 pypi_0 pypi
blessed 1.20.0 pypi_0 pypi
blinker 1.7.0 pypi_0 pypi
blis 0.7.11 pypi_0 pypi
botocore 1.33.12 pypi_0 pypi
brotli 1.1.0 pypi_0 pypi
bzip2 1.0.8 h7b6447c_0
ca-certificates 2023.11.17 hbcca054_0 conda-forge
cachetools 5.3.2 pypi_0 pypi
captum 0.7.0 pypi_0 pypi
catalogue 2.0.10 pypi_0 pypi
certifi 2023.11.17 pypi_0 pypi
charset-normalizer 3.3.2 pypi_0 pypi
click 8.1.7 pypi_0 pypi
cloudpathlib 0.16.0 pypi_0 pypi
cloudpickle 3.0.0 pypi_0 pypi
colorama 0.4.4 pypi_0 pypi
colorful 0.5.5 pypi_0 pypi
comm 0.1.4 pyhd8ed1ab_0 conda-forge
commonmark 0.9.1 pypi_0 pypi
confection 0.1.4 pypi_0 pypi
contourpy 1.2.0 pypi_0 pypi
cycler 0.12.1 pypi_0 pypi
cymem 2.0.8 pypi_0 pypi
cython 3.0.6 pypi_0 pypi
dask 2023.3.2 pypi_0 pypi
dataclasses-json 0.6.3 pypi_0 pypi
datasets 2.15.0 pypi_0 pypi
debugpy 1.6.7 py310h6a678d5_0
decorator 5.1.1 pyhd8ed1ab_0 conda-forge
deepspeed 0.12.4 pypi_0 pypi
dill 0.3.7 pypi_0 pypi
distlib 0.3.8 pypi_0 pypi
docutils 0.16 pypi_0 pypi
entrypoints 0.4 pyhd8ed1ab_0 conda-forge
et-xmlfile 1.1.0 pypi_0 pypi
exceptiongroup 1.2.0 pyhd8ed1ab_0 conda-forge
executing 2.0.1 pyhd8ed1ab_0 conda-forge
faiss-cpu 1.7.4 pypi_0 pypi
fastapi 0.105.0 pypi_0 pypi
filelock 3.13.1 pypi_0 pypi
flask 3.0.0 pypi_0 pypi
flask-compress 1.14 pypi_0 pypi
fonttools 4.46.0 pypi_0 pypi
frozenlist 1.4.0 pypi_0 pypi
fsspec 2023.10.0 pypi_0 pypi
future 0.18.3 pypi_0 pypi
getdaft 0.1.20 pypi_0 pypi
google-api-core 2.15.0 pypi_0 pypi
google-auth 2.25.2 pypi_0 pypi
google-auth-oauthlib 1.1.0 pypi_0 pypi
googleapis-common-protos 1.62.0 pypi_0 pypi
gpustat 1.1.1 pypi_0 pypi
gputil 1.4.0 pypi_0 pypi
grpcio 1.51.3 pypi_0 pypi
h11 0.14.0 pypi_0 pypi
h5py 3.10.0 pypi_0 pypi
hiplot 0.1.33 pypi_0 pypi
hjson 3.1.0 pypi_0 pypi
html5lib 1.1 pypi_0 pypi
httpcore 1.0.2 pypi_0 pypi
httpx 0.25.2 pypi_0 pypi
huggingface-hub 0.19.4 pypi_0 pypi
hummingbird-ml 0.4.9 pypi_0 pypi
hyperopt 0.2.7 pypi_0 pypi
idna 3.6 pypi_0 pypi
imagecodecs 2023.9.18 pypi_0 pypi
importlib-metadata 7.0.0 pypi_0 pypi
ipykernel 6.26.0 pyhf8b6a83_0 conda-forge
ipython 8.18.1 pyh707e725_3 conda-forge
itsdangerous 2.1.2 pypi_0 pypi
jedi 0.19.1 pyhd8ed1ab_0 conda-forge
jinja2 3.1.2 pypi_0 pypi
jmespath 1.0.1 pypi_0 pypi
joblib 1.3.2 pypi_0 pypi
jsonschema 4.6.2 pypi_0 pypi
jupyter_client 7.3.4 pyhd8ed1ab_0 conda-forge
jupyter_core 5.5.0 py310hff52083_0 conda-forge
kaggle 1.5.16 pypi_0 pypi
kiwisolver 1.4.5 pypi_0 pypi
langcodes 3.3.0 pypi_0 pypi
ld_impl_linux-64 2.38 h1181459_1
libffi 3.4.4 h6a678d5_0
libgcc-ng 11.2.0 h1234567_1
libgfortran-ng 7.5.0 ha8ba4b0_17
libgfortran4 7.5.0 ha8ba4b0_17
libgomp 11.2.0 h1234567_1
libsodium 1.0.18 h36c2ea0_1 conda-forge
libstdcxx-ng 11.2.0 h1234567_1
libuuid 1.41.5 h5eee18b_0
lightgbm 4.1.0 pypi_0 pypi
lightgbm-ray 0.1.9 pypi_0 pypi
locket 1.0.0 pypi_0 pypi
loguru 0.7.2 pypi_0 pypi
loralib 0.1.2 pypi_0 pypi
ludwig 0.9.dev0 pypi_0 pypi
lxml 4.9.3 pypi_0 pypi
markdown 3.5.1 pypi_0 pypi
markupsafe 2.1.3 pypi_0 pypi
marshmallow 3.20.1 pypi_0 pypi
marshmallow-dataclass 8.5.4 pypi_0 pypi
marshmallow-jsonschema 0.13.0 pypi_0 pypi
matplotlib 3.8.2 pypi_0 pypi
matplotlib-inline 0.1.6 pyhd8ed1ab_0 conda-forge
mpi 1.0 mpich
mpi4py 3.1.4 py310hfc96bbd_0
mpich 3.3.2 hc856adb_0
mpmath 1.3.0 pypi_0 pypi
msgpack 1.0.7 pypi_0 pypi
multidict 6.0.4 pypi_0 pypi
multiprocess 0.70.15 pypi_0 pypi
murmurhash 1.0.10 pypi_0 pypi
mypy-extensions 1.0.0 pypi_0 pypi
ncurses 6.4 h6a678d5_0
nest-asyncio 1.5.8 pyhd8ed1ab_0 conda-forge
networkx 3.2.1 pypi_0 pypi
ninja 1.11.1.1 pypi_0 pypi
nltk 3.8.1 pypi_0 pypi
numpy 1.26.2 pypi_0 pypi
nvidia-cublas-cu12 12.1.3.1 pypi_0 pypi
nvidia-cuda-cupti-cu12 12.1.105 pypi_0 pypi
nvidia-cuda-nvrtc-cu12 12.1.105 pypi_0 pypi
nvidia-cuda-runtime-cu12 12.1.105 pypi_0 pypi
nvidia-cudnn-cu12 8.9.2.26 pypi_0 pypi
nvidia-cufft-cu12 11.0.2.54 pypi_0 pypi
nvidia-curand-cu12 10.3.2.106 pypi_0 pypi
nvidia-cusolver-cu12 11.4.5.107 pypi_0 pypi
nvidia-cusparse-cu12 12.1.0.106 pypi_0 pypi
nvidia-ml-py 12.535.133 pypi_0 pypi
nvidia-nccl-cu12 2.18.1 pypi_0 pypi
nvidia-nvjitlink-cu12 12.3.101 pypi_0 pypi
nvidia-nvtx-cu12 12.1.105 pypi_0 pypi
oauthlib 3.2.2 pypi_0 pypi
onnx 1.15.0 pypi_0 pypi
onnxconverter-common 1.13.0 pypi_0 pypi
opencensus 0.11.3 pypi_0 pypi
opencensus-context 0.1.3 pypi_0 pypi
openpyxl 3.1.2 pypi_0 pypi
openssl 3.0.12 h7f8727e_0
packaging 23.2 pyhd8ed1ab_0 conda-forge
pandas 2.1.4 pypi_0 pypi
parso 0.8.3 pyhd8ed1ab_0 conda-forge
partd 1.4.1 pypi_0 pypi
peft 0.7.0 pypi_0 pypi
pexpect 4.9.0 pypi_0 pypi
pickleshare 0.7.5 py_1003 conda-forge
pillow 10.1.0 pypi_0 pypi
pip 23.3.1 py310h06a4308_0
platformdirs 3.11.0 pypi_0 pypi
preshed 3.0.9 pypi_0 pypi
prometheus-client 0.19.0 pypi_0 pypi
prompt-toolkit 3.0.41 pyha770c72_0 conda-forge
protobuf 3.20.3 pypi_0 pypi
psutil 5.9.4 pypi_0 pypi
ptitprince 0.2.7 pypi_0 pypi
ptyprocess 0.7.0 pyhd3deb0d_0 conda-forge
pure_eval 0.2.2 pyhd8ed1ab_0 conda-forge
py 1.11.0 pypi_0 pypi
py-cpuinfo 9.0.0 pypi_0 pypi
py-spy 0.3.14 pypi_0 pypi
py4j 0.10.9.7 pypi_0 pypi
pyarrow 14.0.1 pypi_0 pypi
pyarrow-hotfix 0.6 pypi_0 pypi
pyasn1 0.5.1 pypi_0 pypi
pyasn1-modules 0.3.0 pypi_0 pypi
pydantic 1.10.13 pypi_0 pypi
pygments 2.17.2 pyhd8ed1ab_0 conda-forge
pynvml 11.5.0 pypi_0 pypi
pyparsing 3.1.1 pypi_0 pypi
pyrsistent 0.20.0 pypi_0 pypi
python 3.10.13 h955ad1f_0
python-dateutil 2.8.2 pyhd8ed1ab_0 conda-forge
python-multipart 0.0.6 pypi_0 pypi
python-slugify 8.0.1 pypi_0 pypi
python_abi 3.10 2_cp310 conda-forge
pytz 2023.3.post1 pypi_0 pypi
pyxlsb 1.0.10 pypi_0 pypi
pyyaml 6.0 pypi_0 pypi
pyzmq 25.1.0 py310h6a678d5_0
ray 2.4.0 pypi_0 pypi
readline 8.2 h5eee18b_0
regex 2023.10.3 pypi_0 pypi
requests 2.31.0 pypi_0 pypi
requests-oauthlib 1.3.1 pypi_0 pypi
retry 0.9.2 pypi_0 pypi
rich 12.4.4 pypi_0 pypi
rsa 4.7.2 pypi_0 pypi
s3fs 0.4.2 pypi_0 pypi
s3transfer 0.8.2 pypi_0 pypi
sacremoses 0.1.1 pypi_0 pypi
safetensors 0.4.1 pypi_0 pypi
scikit-learn 1.3.2 pypi_0 pypi
scipy 1.11.4 pypi_0 pypi
seaborn 0.11.0 pypi_0 pypi
sentence-transformers 2.2.2 pypi_0 pypi
sentencepiece 0.1.99 pypi_0 pypi
setuptools 68.0.0 py310h06a4308_0
six 1.16.0 pyh6c4a22f_0 conda-forge
smart-open 6.4.0 pypi_0 pypi
sniffio 1.3.0 pypi_0 pypi
soupsieve 2.5 pypi_0 pypi
spacy 3.7.2 pypi_0 pypi
spacy-legacy 3.0.12 pypi_0 pypi
spacy-loggers 1.0.5 pypi_0 pypi
sqlite 3.41.2 h5eee18b_0
srsly 2.4.8 pypi_0 pypi
stack-data 0.6.3 pypi_0 pypi
stack_data 0.6.2 pyhd8ed1ab_0 conda-forge
starlette 0.27.0 pypi_0 pypi
sympy 1.12 pypi_0 pypi
tabulate 0.9.0 pypi_0 pypi
tblib 3.0.0 pypi_0 pypi
tensorboard 2.15.1 pypi_0 pypi
tensorboard-data-server 0.7.2 pypi_0 pypi
tensorboardx 2.2 pypi_0 pypi
text-unidecode 1.3 pypi_0 pypi
thinc 8.2.1 pypi_0 pypi
threadpoolctl 3.2.0 pypi_0 pypi
tifffile 2023.12.9 pypi_0 pypi
tk 8.6.12 h1ccaba5_0
tokenizers 0.15.0 pypi_0 pypi
toolz 0.12.0 pypi_0 pypi
torch 2.1.1 pypi_0 pypi
torchaudio 2.1.1 pypi_0 pypi
torchdata 0.7.1 pypi_0 pypi
torchinfo 1.8.0 pypi_0 pypi
torchmetrics 0.11.4 pypi_0 pypi
torchtext 0.16.1 pypi_0 pypi
torchvision 0.16.1 pypi_0 pypi
tornado 6.1 py310h5764c6d_3 conda-forge
tqdm 4.66.1 pypi_0 pypi
traitlets 5.14.0 pyhd8ed1ab_0 conda-forge
transformers 4.35.2 pypi_0 pypi
triton 2.1.0 pypi_0 pypi
typer 0.9.0 pypi_0 pypi
typing-inspect 0.9.0 pypi_0 pypi
typing_extensions 4.9.0 pyha770c72_0 conda-forge
tzdata 2023.3 pypi_0 pypi
urllib3 2.0.7 pypi_0 pypi
uvicorn 0.24.0.post1 pypi_0 pypi
virtualenv 20.21.0 pypi_0 pypi
wasabi 1.1.2 pypi_0 pypi
wcwidth 0.2.12 pyhd8ed1ab_0 conda-forge
weasel 0.3.4 pypi_0 pypi
webencodings 0.5.1 pypi_0 pypi
werkzeug 3.0.1 pypi_0 pypi
wheel 0.41.2 py310h06a4308_0
wrapt 1.16.0 pypi_0 pypi
xgboost 2.0.2 pypi_0 pypi
xgboost-ray 0.1.18 pypi_0 pypi
xlrd 2.0.1 pypi_0 pypi
xlsxwriter 3.1.9 pypi_0 pypi
xlwt 1.3.0 pypi_0 pypi
xxhash 3.4.1 pypi_0 pypi
xz 5.4.5 h5eee18b_0
yarl 1.9.4 pypi_0 pypi
zeromq 4.3.4 h2531618_0
zipp 3.17.0 pypi_0 pypi
zlib 1.2.13 h5eee18b_0
Hello @arnavgarg1 - Kind follow-up on this. In the meantime when I executed with below config, the process completed successfully with both the infra configuration
Is there anyway to have the training successful with max_sequence_length set to 4096 (Merging both input and output)? As you mentioned with Single GPU it will support till 2048. But with 4096 context length, is it achievable via Multi GPU?
model_type: llm
base_model: /root/CodeLlama-7b-Python-hf
quantization:
bits: 4
adapter:
type: lora
prompt:
template: |
### Instruction:
{Instruction}
### Context:
{Context}
### Input:
{Input}
### Response:
input_features:
- name: prompt
type: text
preprocessing:
max_sequence_length: 2048
output_features:
- name: Response
type: text
preprocessing:
max_sequence_length: 2048
trainer:
type: finetune
learning_rate: 0.0001
batch_size: 1
max_batch_size: 1
gradient_accumulation_steps: 1
enable_gradient_checkpointing: true
epochs: 3
learning_rate_scheduler:
warmup_fraction: 0.01
preprocessing:
sample_ratio: 1.0
backend:
type: local
still ray doesn't work for quantization ? any idea ?
Hi, I'm trying to do a distributed training on llama-7b in a VM having two Tesla T4 GPU's using ray with strategy as deepspeed. I'm facing the following error "Could not pickle object as excessively deep recursion required."
My current OS is ubuntu :20.04 python version: 3.10.13 model.yaml:
Environment:
Can you guide me in solving this Thanks in advance!!