Closed xkszltl closed 5 months ago
Thanks for the repro @xkszltl , will try to repro and fix the issue and make another patch release
Hi @xkszltl
it seems to be caused by remove_unused_columns=False,
, can you meanwhile either revert to 0.7.7 or set remove_unused_columns=True
? I'll try to provide the right fix meanwhile
True won't work at all (regardless of the version), that's why it's False initially. My understanding is it drops the "text" column because model only has "input_ids" on its interface?
Traceback (most recent call last):
File "./test.py", line 58, in main
trainer.train()
File "/usr/local/lib/python3.10/dist-packages/trl/trainer/sft_trainer.py", line 315, in train
output = super().train(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1537, in train
return inner_training_loop(
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1821, in _inner_training_loop
for step, inputs in enumerate(epoch_iterator):
File "/usr/local/lib/python3.10/dist-packages/accelerate/data_loader.py", line 451, in __iter__
current_batch = next(dataloader_iter)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 630, in __next__
data = self._next_data()
File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 674, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
data = self.dataset.__getitems__(possibly_batched_index)
File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 2804, in __getitems__
batch = self.__getitem__(keys)
File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 2800, in __getitem__
return self._getitem(key)
File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 2784, in _getitem
pa_subtable = query_table(self._data, key, indices=self._indices if self._indices is not None else None)
File "/usr/local/lib/python3.10/dist-packages/datasets/formatting/formatting.py", line 583, in query_table
_check_valid_index_key(key, size)
File "/usr/local/lib/python3.10/dist-packages/datasets/formatting/formatting.py", line 536, in _check_valid_index_key
_check_valid_index_key(int(max(key)), size=size)
File "/usr/local/lib/python3.10/dist-packages/datasets/formatting/formatting.py", line 526, in _check_valid_index_key
raise IndexError(f"Invalid key: {key} is out of bounds for size {size}")
IndexError: Invalid key: 23887 is out of bounds for size 0
I'm currently pinning to trl<0.7.8
.
hi @xkszltl Thnks for your patience! I had a deeper look at the issue and I made https://github.com/huggingface/trl/pull/1229 that should resolve it. Regarding https://github.com/huggingface/trl/issues/1216#issuecomment-1886387405 - can you try to update datasets? I cannot repro with:
import datasets
import peft
import transformers
import trl
model_dir = "HuggingFaceM4/tiny-random-LlamaForCausalLM"
tokenizer = transformers.AutoTokenizer.from_pretrained(model_dir)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
model = transformers.AutoModelForCausalLM.from_pretrained(model_dir)
ds_train = datasets.load_dataset("imdb", split="train[:10]")
trainer = trl.SFTTrainer(
model=model,
args=transformers.TrainingArguments(
output_dir="output",
max_steps=1,
remove_unused_columns=True,
),
peft_config=peft.LoraConfig(
lora_alpha=16,
lora_dropout=0.1,
r=8,
bias="none",
task_type="Causal_LM",
),
train_dataset=ds_train,
tokenizer=tokenizer,
dataset_text_field="text",
max_seq_length=8,
)
trainer.train()
Are you trying with master or a release?
@xkszltl on master currently
@xkszltl can you try and let me know how it goes?
Only tried on released wheel so that may be the reason. I can give master a try after that PR is merged.
I see ok ! if you want you can build from that branch:
pip install -U git+https://github.com/huggingface/trl.git@fix-breaking-change
Still repros on the branch, and I'm using a different dataset this time, not just imdb.
In case version matters:
Name: accelerate
Version: 0.26.0
Summary: Accelerate
Home-page: https://github.com/huggingface/accelerate
Author: The HuggingFace team
Author-email: sylvain@huggingface.co
License: Apache
Location: /usr/local/lib/python3.10/dist-packages
Requires: huggingface-hub, numpy, packaging, psutil, pyyaml, safetensors, torch
Required-by: peft, trl
---
Name: datasets
Version: 2.16.1
Summary: HuggingFace community-driven open-source library of datasets
Home-page: https://github.com/huggingface/datasets
Author: HuggingFace Inc.
Author-email: thomas@huggingface.co
License: Apache 2.0
Location: /usr/local/lib/python3.10/dist-packages
Requires: aiohttp, dill, filelock, fsspec, huggingface-hub, multiprocess, numpy, packaging, pandas, pyarrow, pyarrow-hotfix, pyyaml, requests, tqdm, xxhash
Required-by: trl
---
Name: torch
Version: 2.1.2
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD-3
Location: /usr/local/lib/python3.10/dist-packages
Requires: filelock, fsspec, jinja2, networkx, nvidia-cublas-cu12, nvidia-cuda-cupti-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-runtime-cu12, nvidia-cudnn-cu12, nvidia-cufft-cu12, nvidia-curand-cu12, nvidia-cusolver-cu12, nvidia-cusparse-cu12, nvidia-nccl-cu12, nvidia-nvtx-cu12, sympy, triton, typing-extensions
Required-by: accelerate, peft, trl
---
Name: transformers
Version: 4.36.2
Summary: State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow
Home-page: https://github.com/huggingface/transformers
Author: The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors)
Author-email: transformers@huggingface.co
License: Apache 2.0 License
Location: /usr/local/lib/python3.10/dist-packages
Requires: filelock, huggingface-hub, numpy, packaging, pyyaml, regex, requests, safetensors, tokenizers, tqdm
Required-by: peft, trl
---
Name: trl
Version: 0.7.10.dev0
Summary: Train transformer language models with reinforcement learning.
Home-page: https://github.com/huggingface/trl
Author: Leandro von Werra
Author-email: leandro.vonwerra@gmail.com
License: Apache 2.0
Location: /usr/local/lib/python3.10/dist-packages
Requires: accelerate, datasets, numpy, torch, transformers, tyro
Required-by:
@xkszltl I am using the same library versions as you and was not able to repro, did you run this script: https://github.com/huggingface/trl/issues/1216#issuecomment-1892207860 ?
# CUDA_VISIBLE_DEVICES=0 ./try.py
/usr/lib/python3/dist-packages/requests/__init__.py:87: RequestsDependencyWarning: urllib3 (2.1.0) or chardet (4.0.0) doesn't match a supported version!
warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
tokenizer_config.json: 100%|████████████████████████████████████████████████████████████████| 771/771 [00:00<00:00, 5.36MB/s]
tokenizer.model: 100%|█████████████████████████████████████████████████████████████████████| 500k/500k [00:00<00:00, 789kB/s]
tokenizer.json: 100%|███████████████████████████████████████████████████████████████████| 1.84M/1.84M [00:00<00:00, 4.61MB/s]
special_tokens_map.json: 100%|██████████████████████████████████████████████████████████████| 552/552 [00:00<00:00, 4.44MB/s]
config.json: 100%|██████████████████████████████████████████████████████████████████████████| 466/466 [00:00<00:00, 3.74MB/s]
pytorch_model.bin: 100%|████████████████████████████████████████████████████████████████| 2.07M/2.07M [00:00<00:00, 10.1MB/s]
generation_config.json: 100%|███████████████████████████████████████████████████████████████| 138/138 [00:00<00:00, 1.06MB/s]
Downloading readme: 100%|███████████████████████████████████████████████████████████████| 7.81k/7.81k [00:00<00:00, 38.4MB/s]
Downloading data: 100%|█████████████████████████████████████████████████████████████████| 21.0M/21.0M [00:03<00:00, 6.42MB/s]
Downloading data: 100%|█████████████████████████████████████████████████████████████████| 20.5M/20.5M [00:03<00:00, 6.45MB/s]
Downloading data: 100%|█████████████████████████████████████████████████████████████████| 42.0M/42.0M [00:05<00:00, 7.14MB/s]
Generating train split: 100%|███████████████████████████████████████████████| 25000/25000 [00:00<00:00, 192120.43 examples/s]
Generating test split: 100%|████████████████████████████████████████████████| 25000/25000 [00:00<00:00, 205666.45 examples/s]
Generating unsupervised split: 100%|████████████████████████████████████████| 50000/50000 [00:00<00:00, 221609.66 examples/s]
Map: 100%|██████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 1023.63 examples/s]
Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
0%| | 0/1 [00:00<?, ?it/s]Traceback (most recent call last):
File "./try.py", line 38, in <module>
trainer.train()
File "/usr/local/lib/python3.10/dist-packages/trl/trainer/sft_trainer.py", line 330, in train
output = super().train(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1537, in train
return inner_training_loop(
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1821, in _inner_training_loop
for step, inputs in enumerate(epoch_iterator):
File "/usr/local/lib/python3.10/dist-packages/accelerate/data_loader.py", line 451, in __iter__
current_batch = next(dataloader_iter)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 630, in __next__
data = self._next_data()
File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 674, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
data = self.dataset.__getitems__(possibly_batched_index)
File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 2805, in __getitems__
batch = self.__getitem__(keys)
File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 2801, in __getitem__
return self._getitem(key)
File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 2785, in _getitem
pa_subtable = query_table(self._data, key, indices=self._indices if self._indices is not None else None)
File "/usr/local/lib/python3.10/dist-packages/datasets/formatting/formatting.py", line 583, in query_table
_check_valid_index_key(key, size)
File "/usr/local/lib/python3.10/dist-packages/datasets/formatting/formatting.py", line 536, in _check_valid_index_key
_check_valid_index_key(int(max(key)), size=size)
File "/usr/local/lib/python3.10/dist-packages/datasets/formatting/formatting.py", line 526, in _check_valid_index_key
raise IndexError(f"Invalid key: {key} is out of bounds for size {size}")
IndexError: Invalid key: 9 is out of bounds for size 0
0%| | 0/1 [00:00<?, ?it/s]
This is the output from that script. And everything is very fresh because it's in a docker, and you can see both the model/dataset are pulled in this run, not even from cache.
And I've seen others talking about something similar:
This is a new regression introduced in trl 0.7.8 (and 0.7.9), 0.7.7 is fine.
We run into issues of
ValueError: too many dimensions 'str'
when loading data to the trainer. Here's a simple LLAMA2+LoRA fine-tuning on IMDB dataset as minimal repro:0.7.7 works:
0.7.8 failed: