huggingface / autotrain-advanced

🤗 AutoTrain Advanced
https://huggingface.co/autotrain
Apache License 2.0
3.79k stars 466 forks source link

Do we need to insert bos and eos tokens fully in the train.csv? #753

Open jackswl opened 1 week ago

jackswl commented 1 week ago

When fine-tuning LLM using train.csv, does the sample require the full template which includes the bos and eos?

For example, if the model bos_token is <s>, do I need to include it into the train.csv sample as well?

jackswl commented 1 week ago

@abhishekkrthakur, for example, for mistral 7b instruct v0.3, will each of the example in train.csv look something like this in the text column: <s>[INST] hi this is user[/INST] this is assistant </s>

is this right? it will include the entire thing with the chat template already included, right?

this is when the chat_template is set to null in the .yml file

abhishekkrthakur commented 1 week ago

you dont need to format dataset and keep it in json format if using chat template. for example no_robots dataset. if you dataset is plain test and you are training a chat model, you need to add special tokens and tags. there is a parameter that can add end token by the way.

jackswl commented 1 week ago

@abhishekkrthakur, i am currently using plain text. However, my plain text already contains all the special tokens and tags from pre-applying the chat template myself. Is this approach ok?

so for example, one of my sample inside train.csv can be: <s>[INST] hi this is user[/INST] this is assistant </s>, which is already applied with the chat template beforehand. Just double confirming that this is ok for autotrain? So my train.csv will contains all the plain text in text column that already applied with chat template.

abhishekkrthakur commented 1 week ago

yes. in that case, make sure chat_template is set to none.

jackswl commented 1 week ago

ok thanks for the swift reply. was trying to re-confirm, so I don't get errors like having BOS token applying again on top of my train.csv during fine-tuning.

@abhishekkrthakur however, I realized that when the tokenizer is encoding the plain text, I think it will automatically add the BOS token again. Is that right? in this case, do I need to remove the BOS token from train.csv? and add them during inference?

how I know this is --> when tokenizer encode the plain text, and then i use it decode back, it will automatically add the BOS token to it, which result in me having two BOS tokens.

jackswl commented 1 week ago

@abhishekkrthakur for example:

messages = '''<s>[INST] You are an expert Python programmer. Your task is to do this.[/INST] assistant goes here </s>'''

tokenizer(messages) <-- this will output <s> again on top of the encoded messages string.

By doing this, the <s> will be applied again. So I was wondering if autotrain will do this for my plain text? or should my plain text include all special tags EXCEPT BOS token? i cannot seem to find any info from the source code on this. would kindly need your advise

abhishekkrthakur commented 1 week ago

choose llm generic and disable option to add end token and it will be fine.

jackswl commented 1 week ago

@abhishekkrthakur Could you help on this? Just re-clarifying. This is my original .yml file below. Do I change the task in the first line from llm-sft to llm, then insert add_eos_token: false under data section, then also add trainer: default under data section? Did I leave out anything? I am using Lora too.

will this cause any problems not using sft? since my task is actually sft.

task: llm-sft
base_model: /scratch/xxx
project_name: xxx
log: none
backend: local

data:
  path: /home/xxx
  train_split: train
  valid_split: null
  chat_template: null
  column_mapping:
    text_column: text

params:
  block_size: 4096
  model_max_length: 4096
  epochs: 20
  batch_size: 4 
  lr: 1e-4
  peft: true
  quantization: int4
  target_modules: "q_proj,v_proj,o_proj,k_proj,gate_proj,down_proj,up_proj" 
  padding: right
  optimizer: adamw_torch
  scheduler: cosine
  gradient_accumulation: 16 
  mixed_precision: bf16         
  warmup_ratio: 0.1
  weight_decay: 0.1
  lora_r: 16
  lora_alpha: 16
  lora_dropout: 0
  merge_adapter: false
  use_flash_attention_2: true  
  logging_steps: 1
  unsloth: false
  seed: 42
jackswl commented 1 week ago

@abhishekkrthakur because there is no generic template configs, hence I am not 100% sure on this. would be really beneficial if you can clarify on this.

abhishekkrthakur commented 1 week ago

its just llm. if you remove the sft and use only llm or add trainer: default, its generic training.

abhishekkrthakur commented 1 week ago

ill add a config :) thanks for letting me know

jackswl commented 1 week ago

its just llm. if you remove the sft and use only llm or add trainer: default, its generic training.

sorry @abhishekkrthakur , you meant, i can simply just change the task from llm-sft to llm, and it should be ok? I believe i also need to add add_eos_token: false under params section is that right?

side note: does it mean if I use llm-sft as per what I did originally, then I do not have to add the BOS token and EOS token, since it will be applied on it during finetuning? it's abit contradicting because earlier you mentioned we need to add the special tokens to my plain text.

Let's say I want to do supervised fine-tuning via plain text. The plain text already contained all the chat template thats applied, as mentioned in the comment above.

You are saying I can use llm generic for this task, instead of llm sft? whats the difference between the two?

jackswl commented 1 week ago

@abhishekkrthakur would greatly appreciate your reply on this, thanks a ton

abhishekkrthakur commented 1 week ago

ill add an example for your use case and update here asap :)

jackswl commented 1 week ago

@abhishekkrthakur thanks, I hope my problem was clear.

I have a plain text (just a string) that is already applied with chat template. Which means it will include all the special tokens and tags. Things like EOS and BOS tokens will be applied as well already within this plain text.

I do not wish to have a duplicate BOS token (or EOS token) applying during fine tuning process using autotrain (llm-sft), because the tokenizer will add automatically append BOS token during fine tuning, resulting in double BOS tokens.

Wondering if generic llm like you mentioned can tackle this problem.

Thanks for the help! Looking forward to the update 👍🏻🙏🏻

jackswl commented 1 week ago

@abhishekkrthakur do you have any updates on this? I just want to ensure that I am doing the correct approach using your awesome package, for my use case (plain text in train.csv already has the full chat template applied). Thank you so much

abhishekkrthakur commented 1 week ago

unfortunately, i didnt get a chance to look deeper into it yet. but i will do it and update here as soon as possible. thank you for your patience.

jackswl commented 4 days ago

@abhishekkrthakur any updates on this? i just need to know if your generic trainer will automatically add special tokens (i.e. BOS token) when tokenizing the dataset.

tokenizer(text, add_special_tokens=False).input_ids for instance, the add_special_tokens setting to False will not add the BOS token to the plain text. does your generic trainer set add_special_tokens=True by default?

It seems like it doesnt, because you have this in your utils:

def tokenize(examples, tokenizer, config):
    output = tokenizer(examples[config.text_column])
    return output

and you simply tokenize the dataset plain text. therefore, it will automatically add the BOS token.