Open kk2491 opened 7 months ago
Same problem.
Hi! You should be able to fix this by passing remove_unused_columns=False
to the transformers
TrainingArguments
as explained in https://github.com/huggingface/peft/issues/1299.
(I'm not familiar with Vertex AI, but I'd assume remove_unused_columns
can be passed as a flag to the docker container)
I had the same problem, but I spent a whole day trying different combination with my own dataset with the example data set and found the reason: the example data is multi-turn conversation between human and assistant, so # Humman or # Assistant appear at least twice. If your own custom data only has single turn conversation, it might end up with the same error. What you can do is repeat your single turn conversation twice in your training data (keep the key 'text' the same) and maybe it works. I guess the reason is the specific way processing the data requires and counts multi-turn only (single turn will be discarded so it ends up with no training data), but since I am using Google Vertex AI, I don't have direct access to the underlying code so that was just my guess.
Hi! You should be able to fix this by passing
remove_unused_columns=False
to thetransformers
TrainingArguments
as explained in huggingface/peft#1299.(I'm not familiar with Vertex AI, but I'd assume
remove_unused_columns
can be passed as a flag to the docker container)
@mariosasko Thanks for the response and suggestion.
When I set remove_unused_columns
as False
, I end up getting different error (will post the error soon).
Either the Vertex-AI does not support remove_unused_columns
or my dataset is completely wrong.
Thank you,
KK
I had the same problem, but I spent a whole day trying different combination with my own dataset with the example data set and found the reason: the example data is multi-turn conversation between human and assistant, so # Humman or # Assistant appear at least twice. If your own custom data only has single turn conversation, it might end up with the same error. What you can do is repeat your single turn conversation twice in your training data (keep the key 'text' the same) and maybe it works. I guess the reason is the specific way processing the data requires and counts multi-turn only (single turn will be discarded so it ends up with no training data), but since I am using Google Vertex AI, I don't have direct access to the underlying code so that was just my guess.
@cyberyu Thanks for your suggestions.
I have tried the approach you suggested, copied the same conversation in each jsonl element so every jsonl item has 2 HUMAN
and ASSISTANT
.
However in my case, the issue persists. I am gonna give few more tries, and post the results here.
You can find my dataset here
Thank you,
KK
I had the same problem, but I spent a whole day trying different combination with my own dataset with the example data set and found the reason: the example data is multi-turn conversation between human and assistant, so # Humman or # Assistant appear at least twice. If your own custom data only has single turn conversation, it might end up with the same error. What you can do is repeat your single turn conversation twice in your training data (keep the key 'text' the same) and maybe it works. I guess the reason is the specific way processing the data requires and counts multi-turn only (single turn will be discarded so it ends up with no training data), but since I am using Google Vertex AI, I don't have direct access to the underlying code so that was just my guess.
@cyberyu Thanks for your suggestions. I have tried the approach you suggested, copied the same conversation in each jsonl element so every jsonl item has 2
HUMAN
andASSISTANT
. However in my case, the issue persists. I am gonna give few more tries, and post the results here. You can find my dataset hereThank you, KK
I think another reason is your training sample length is too short. I saw a relevant report (https://discuss.huggingface.co/t/indexerror-invalid-key-16-is-out-of-bounds-for-size-0/14298/16) stating that the processing code might have a bug discarding sequence length short than max_seq_length, which is 512. Not sure the Vertex AI backend code has fixed that bug or not. So I tried to add some garbage content in your data, and extended the length longer than 512 for a single turn, and repeated twice. You can copy the following line as 5 repeated lines as your training data jsonl file of five samples (no eval or test needed, for speed up, set evaluation step to 5 and training step to 10,), and it will pass.
{"text":"### Human: You are a helpful AI Assistant familiar with customer service. You are a helpful AI Assistant familiar with customer service. You are a helpful AI Assistant familiar with customer service. You are a helpful AI Assistant familiar with customer service. You are a helpful AI Assistant familiar with customer service. You are a helpful AI Assistant familiar with customer service. You are a helpful AI Assistant familiar with customer service. You are a helpful AI Assistant familiar with customer service. You are a helpful AI Assistant familiar with customer service. You are a helpful AI Assistant familiar with customer service. You are a helpful AI Assistant familiar with customer service. You are a helpful AI Assistant familiar with customer service. You are a helpful AI Assistant familiar with customer service. You will handle customers queries and provide effective help message. Please provide response to 'Can Interplai software optimize routes for minimizing package handling and transfer times in distribution centers'? ### Assistant: Yes, Interplai software can optimize routes for distribution centers by streamlining package handling processes, minimizing transfer times between loading docks and storage areas, and optimizing warehouse layouts for efficient order fulfillment. ### Human: You are a helpful AI Assistant familiar with customer service. You are a helpful AI Assistant familiar with customer service. You are a helpful AI Assistant familiar with customer service. You are a helpful AI Assistant familiar with customer service. You are a helpful AI Assistant familiar with customer service. You are a helpful AI Assistant familiar with customer service. You are a helpful AI Assistant familiar with customer service. You are a helpful AI Assistant familiar with customer service. You are a helpful AI Assistant familiar with customer service. You are a helpful AI Assistant familiar with customer service. You are a helpful AI Assistant familiar with customer service. You are a helpful AI Assistant familiar with customer service. You are a helpful AI Assistant familiar with customer service. You will handle customers queries and provide effective help message. Please provide response to 'Can Interplai software optimize routes for minimizing package handling and transfer times in distribution centers'? ### Assistant: Yes, Interplai software can optimize routes for distribution centers by streamlining package handling processes, minimizing transfer times between loading docks and storage areas, and optimizing warehouse layouts for efficient order fulfillment."}
@cyberyu Thank you so much, You saved my day (+ so many days).
I tried the example you provided above, and the training is successfully completed in Vertex-AI (through GUI).
I never thought there would be constraints on the length of the samples and also on the number of turns.
I will update my complete dataset and see update here once the training is completed.
Thank you,
KK
Describe the bug
I am trying to fine-tune llama2-7b model in GCP. The notebook I am using for this can be found here.
When I use the dataset given in the example, the training gets successfully completed (example dataset can be found here).
However when I use my own dataset which is in the same format as the example dataset, I get the below error (my dataset can be found here).
I see the files are being read correctly from the logs:
Steps to reproduce the bug
kk2491/finetune_dataset_002
Expected behavior
The training should complete successfully, and model gets deployed to an endpoint.
Environment info
Python version : Python 3.10.12 Dataset : https://huggingface.co/datasets/kk2491/finetune_dataset_002