Open jackswl opened 1 week ago
@abhishekkrthakur, for example, for mistral 7b instruct v0.3,
will each of the example in train.csv look something like this in the text column:
<s>[INST] hi this is user[/INST] this is assistant </s>
is this right? it will include the entire thing with the chat template already included, right?
this is when the chat_template is set to null in the .yml file
you dont need to format dataset and keep it in json format if using chat template. for example no_robots dataset. if you dataset is plain test and you are training a chat model, you need to add special tokens and tags. there is a parameter that can add end token by the way.
@abhishekkrthakur, i am currently using plain text. However, my plain text already contains all the special tokens and tags from pre-applying the chat template myself. Is this approach ok?
so for example, one of my sample inside train.csv can be:
<s>[INST] hi this is user[/INST] this is assistant </s>
, which is already applied with the chat template beforehand. Just double confirming that this is ok for autotrain? So my train.csv will contains all the plain text in text column that already applied with chat template.
yes. in that case, make sure chat_template
is set to none.
ok thanks for the swift reply. was trying to re-confirm, so I don't get errors like having BOS token applying again on top of my train.csv during fine-tuning.
@abhishekkrthakur however, I realized that when the tokenizer is encoding the plain text, I think it will automatically add the BOS token again. Is that right? in this case, do I need to remove the BOS token from train.csv? and add them during inference?
how I know this is --> when tokenizer encode the plain text, and then i use it decode back, it will automatically add the BOS token to it, which result in me having two BOS tokens.
@abhishekkrthakur for example:
messages = '''<s>
[INST] You are an expert Python programmer. Your task is to do this.[/INST]
assistant goes here
</s>
'''
tokenizer(messages) <-- this will output <s>
again on top of the encoded messages string.
By doing this, the <s>
will be applied again. So I was wondering if autotrain will do this for my plain text? or should my plain text include all special tags EXCEPT BOS token? i cannot seem to find any info from the source code on this. would kindly need your advise
choose llm generic and disable option to add end token and it will be fine.
@abhishekkrthakur Could you help on this? Just re-clarifying. This is my original .yml file below. Do I change the task in the first line from llm-sft
to llm
, then insert add_eos_token: false
under data
section, then also add trainer: default
under data
section? Did I leave out anything? I am using Lora too.
will this cause any problems not using sft? since my task is actually sft.
task: llm-sft
base_model: /scratch/xxx
project_name: xxx
log: none
backend: local
data:
path: /home/xxx
train_split: train
valid_split: null
chat_template: null
column_mapping:
text_column: text
params:
block_size: 4096
model_max_length: 4096
epochs: 20
batch_size: 4
lr: 1e-4
peft: true
quantization: int4
target_modules: "q_proj,v_proj,o_proj,k_proj,gate_proj,down_proj,up_proj"
padding: right
optimizer: adamw_torch
scheduler: cosine
gradient_accumulation: 16
mixed_precision: bf16
warmup_ratio: 0.1
weight_decay: 0.1
lora_r: 16
lora_alpha: 16
lora_dropout: 0
merge_adapter: false
use_flash_attention_2: true
logging_steps: 1
unsloth: false
seed: 42
@abhishekkrthakur because there is no generic template configs, hence I am not 100% sure on this. would be really beneficial if you can clarify on this.
its just llm. if you remove the sft and use only llm or add trainer: default, its generic training.
ill add a config :) thanks for letting me know
its just llm. if you remove the sft and use only llm or add trainer: default, its generic training.
sorry @abhishekkrthakur , you meant, i can simply just change the task from llm-sft to llm, and it should be ok? I believe i also need to add add_eos_token: false
under params section is that right?
side note: does it mean if I use llm-sft as per what I did originally, then I do not have to add the BOS token and EOS token, since it will be applied on it during finetuning? it's abit contradicting because earlier you mentioned we need to add the special tokens to my plain text.
Let's say I want to do supervised fine-tuning via plain text. The plain text already contained all the chat template thats applied, as mentioned in the comment above.
You are saying I can use llm generic for this task, instead of llm sft? whats the difference between the two?
@abhishekkrthakur would greatly appreciate your reply on this, thanks a ton
ill add an example for your use case and update here asap :)
@abhishekkrthakur thanks, I hope my problem was clear.
I have a plain text (just a string) that is already applied with chat template. Which means it will include all the special tokens and tags. Things like EOS and BOS tokens will be applied as well already within this plain text.
I do not wish to have a duplicate BOS token (or EOS token) applying during fine tuning process using autotrain (llm-sft), because the tokenizer will add automatically append BOS token during fine tuning, resulting in double BOS tokens.
Wondering if generic llm like you mentioned can tackle this problem.
Thanks for the help! Looking forward to the update 👍🏻🙏🏻
@abhishekkrthakur do you have any updates on this? I just want to ensure that I am doing the correct approach using your awesome package, for my use case (plain text in train.csv already has the full chat template applied). Thank you so much
unfortunately, i didnt get a chance to look deeper into it yet. but i will do it and update here as soon as possible. thank you for your patience.
@abhishekkrthakur any updates on this? i just need to know if your generic trainer will automatically add special tokens (i.e. BOS token) when tokenizing the dataset.
tokenizer(text, add_special_tokens=False).input_ids
for instance, the add_special_tokens setting to False will not add the BOS token to the plain text. does your generic trainer set add_special_tokens=True by default?
It seems like it doesnt, because you have this in your utils:
def tokenize(examples, tokenizer, config):
output = tokenizer(examples[config.text_column])
return output
and you simply tokenize the dataset plain text. therefore, it will automatically add the BOS token.
When fine-tuning LLM using train.csv, does the sample require the full template which includes the bos and eos?
For example, if the model bos_token is
<s>
, do I need to include it into the train.csv sample as well?