pytorch / torchtune

PyTorch native finetuning library
https://pytorch.org/torchtune/main/
BSD 3-Clause "New" or "Revised" License
4.39k stars 448 forks source link

using Torchtune to teach LLMs a new language #1859

Open almugabo opened 1 month ago

almugabo commented 1 month ago

I am trying to full fine tune Llama3.2-1b to "teach" it another language (via continous pretraining). he idea is to have a model, which, given a prompt in a language , it continues the sentences in that language. I am using a dataset of about 25 Million words.

When I use Unsloth for Qlora finetuning a 4bit model, after 3 epochs, the model performs how I would expect it. (I give a prompt in that language and get as response new text in the language which makes sense.

However, when using Torchtune (with text completion dataset), even after 5 epochs the results are not what I would expect. It just continue in English or outputs non-sensical sentences. P.S: the loss also behaves funny. it goes down, then up then down almost erratic (downtrends).

My question is : what I am doing wrong ?

Below is my configuration file:


#Model and checkpointing 

xpath_model_chkp: './CPT_Llama/LLama_32_1b/_Torchtune/_model'
xpath_model_logs: './CPT_Llama/LLama_32_1b/_Torchtune/_model_logs'
xpath_dataset: './_datasets/monolingual_train.json'

# Tokenizer
tokenizer:
  _component_: torchtune.models.llama3.llama3_tokenizer
  path: ${xpath_model_chkp}/original/tokenizer.model
  max_seq_len: 2048

# Dataset
dataset:
  _component_: torchtune.datasets.text_completion_dataset
  source: json
  data_files: ${xpath_dataset}
  column: text
  split: train

seed: null
shuffle: True

# Model Arguments
model:
  _component_: torchtune.models.llama3_2.llama3_2_1b

checkpointer:
  _component_: torchtune.training.FullModelHFCheckpointer
  checkpoint_dir: ${xpath_model_chkp}
  checkpoint_files: [
    model.safetensors
  ]
  recipe_checkpoint: null
  output_dir: ${xpath_model_chkp}
  model_type: LLAMA3_2
resume_from_checkpoint: False

# Fine-tuning arguments
batch_size: 4
epochs: 5
optimizer:
  _component_: bitsandbytes.optim.PagedAdamW8bit
  lr: 2e-5
loss:
  _component_: torchtune.modules.loss.CEWithChunkedOutputLoss
max_steps_per_epoch: null #10  ## for testing, otherwise set to null
gradient_accumulation_steps: 1
optimizer_in_bwd: True
compile: True #False # set it to True for better memory and performance

# Training environment
device: cuda

# Memory management
enable_activation_checkpointing: True

# Reduced precision
dtype: bf16

# Logging
metric_logger:
  _component_: torchtune.training.metric_logging.DiskLogger
  log_dir: ${xpath_model_logs}
output_dir: ${xpath_model_logs}
log_every_n_steps: 1
log_peak_memory_stats: False
joecummings commented 1 month ago

This seems like a super cool usecase @almugabo !

Lemme ask a couple follow up questions:

almugabo commented 1 month ago

thank you for the reply. The language I am trying to "teach Llama" is Kinyarwanda (language spoken in Rwanda one of the official languages alongside English and French).

As mentioned, it "works" with PEFT/Qlora but I was hoping to get better performance with full fine-tuning.

  1. here a sample of the dataset
    {"text": "Yevgeny Prigozhin wayoboraga abarwanyi ba Wagner yapfuye. Itangazamakuru rya Leta y’u Burusiya riravuga ko iyo ndege yavaga mu Majyaruguru ya Moscow yerekeza mu mujyi wa St Petersburg, yari irimo abantu icumi barimo abagenzi barindwi n’abakozi b’iyo ndege batatu, bose bakaba bahasize ubuzima. Icyakora uruhande rw’abarwanyi Yevgeny Prigozhin yari ayoboye nta cyo rwahise rubitangazaho. Abo barwanyi bari baherutse kugaba ibitero byari bigamije guhirika ubutegetsi bw’u Burusiya, gusa hakaba andi makuru avuga ko gahunda yo guhirika ubutegetsi nta yari ihari, ahubwo ko Yevgeny Prigozhin n’abarwanyi be baba barakiriye amafaranga bahawe na Leta Zunze Ubumwe za Amerika nka ruswa, ibyo gushaka guhirika ubutegetsi bakabikora ari uburyo bwo kwiyerurutsa, ndetse bigategurwa ku bwumvikane na Perezida w’u Burusiya Vladimir Putin. Icyakora abandi baravuga ko nubwo Putin nta cyemezo gikomeye yafatiye abo bashatse guhirika ubutegetsi bwe, ashobora kuba yarakomeje kubagirira amakenga. Mu gihe bamwe bavuga ko Putin yashoboraga kumuhitana kubera ubwo bugambanyi bwe, abandi baravuga ko na Amerika yashoboraga kumugirira nabi kubera ko yayibeshye ndetse akabatwarira n’amafaranga ntakore ibyo bumvikanye. Abarwanyi ba Wagner yari ayoboye kandi, bakunze kubangamira inyungu za Amerika mu bihugu bakoreramo bya Afurika. Prigozhin ntiyakunze kugaragara mu ruhame nyuma y’uko muri Kamena 2023 ayoboye kudeta yamaze amasaha 24 ariko ntigire icyo igeraho. Yaherukaga kugaragara muri video mu ntangiriro z’iki cyumweru, iyo video bikavugwa ko yafatiwe muri Afurika ahantu hatatangajwe. Umunyamakuru @ h_malachie", "nwords": 219, "ntokens_llama32": 614}
    {"text": "Musanze:Abanyerondo baketse umusore ho ubujurura baramukubita bimuviramo urupfu. Abanyerondo bafashe umusore bakekaga ko ari umujura baramukubita ubundi bamunyuza mu muhanda hagati imodoka iramugonga ahita apfa.Mu kagari ka Gisesero, umurenge wa Busogo ho mu karere ka Musanze habereye impanuka yahitanye umusore wakekwagaho ibikorwa by’ubujura, Abaturage bavuga ko byatewe n’abanyerondo bagendaga bamukubita.Abatanze ubuhamya bavuga ko bari bahari bavuga ko uyu musore yakubiswe bikabije maze agata ubwenge cyangwa se inkoni ziramuhungabanya yerekera mu muhanda atabizi imodoka iramugongoNta makuru batanze avuga ko uyu musore yaba yarasanzwe yiba.icyakora bose bitsa ku kuba ngo abanyerondo bamukubise bamuketse nk’igisambo.Umuvugizi wa Polisi mu ntara y’Amajyaruguru Superitendent Jean Bosco Mwiseneza we yabwiye BTN ko uyu musore yazize umushofere utaringanije umuvuduko.Icyakora ntiyashimye gutanga amakuru ku bubasha abanyerondo bafite bwo kwambika umuntu amapingo bakamukubita byamuviriyemo urupfu", "nwords": 125, "ntokens_llama32": 376}
    {"text": "Minisitiri wa Siporo yakiriye Team Rwanda ivuye muri Shampiyona Nyafurika. Ni igikorwa cyabaye kuri uyu wa Mbere tariki ya 08 Werurwe 2021, nk’uko tubikesha urubuga rwa Twitter rwa Minisiteri ya Siporo, rwanditseho ko Minisitiri wa Siporo Aurore Mimosa Munyangaju yashimye umusaruro w’Ikipe y’Igihugu y’Amagare.\nYagize ati \"Ibyo mwakoze turabibashimira, mwahaye ibyishimo Abanyarwanda. Abanyarwanda bose bamaze kumva ko aho mugiye nta mpungenge, ko muzitwara neza\". Kapiteni wa Team Rwanda, Joseph Areruya, yavuze ko imidali 14 batwaye idahagije ugereranyije n’ibyo bifuzaga, bikaba byaratewe n’uko bakoze imyiteguro idahagije kubera Covid-19. Yongeraho ko hakenewe imyitozo myinshi kugira ngo barusheho kwitegura n’andi marushanwa ategerejwe arimo na Tour du Rwanda 2021. Shampiyona Nyafurika yebereye mu Mujyi wa Cairo mu Misiri kuva tariki ya 03 kugera ku ya 06 Werurwe 2021, aho ikipe y’u Rwanda yegukanye imidari 14 irimo umwe wa Zahabu wegukanywe na Tuyizere Etienne mu gusiganwa mu muhanda ( road race) mu cyiciro cy’ingimbi. Umunyamakuru wa Kigali Today/KT Radio @ KuradusengIsaac", "nwords": 155, "ntokens_llama32": 422}
    {"text": "Urubanza rwa Bandora rwasubitswe ku munsi warwo wa mbere. Mu rubanza rutamaze imitota igera kuri 20 rubera mu Rukiko rwisumbuye rwa Nyarugenge, ubushinjacyaha bwabanje kumenyesha Bandora ibyaha byose aregwa, ariko bumusabye kwisobanura ahita atangaza ko adashobora kuburana. Urukiko rwahise rutangaza ko rugomba gusuzuma icyo cyifuzo, rukazamusubiza kuri uyu wa gatatu tariki 20/03/2013, ku isaha ya saa munani. Bandora yagejejwe mu Rwanda tariki 10/03/2013 akuwe muri Norvege, kubera ibyaha akurikiranyweho byo kugira uruhare muri Jenoside yakorewe Abatutsi mu 1994 n’ibyaha byibasiye inyokomuntu. Bandora yafatiwe muri Malawi aho yakoraga ubucuruzi ariko aza kurekurwa. Yahavuye ajya mu Bubiligi aho yafungiwe ariko naho akaza kurekurwa mbere yo kwerekeza muri Norvege. Bandora wavutse mu 1953 mu cyahoze ari Perefegitura ya Gikongoro, akekwaho kuba yaragize uruhare mu gutoza Interahamwe mu Bugesera no guhagarikira ubwicanyi. Emmanuel N. Hitimana", "nwords": 131, "ntokens_llama32": 373}
  2. I am using base model
  3. I save the model with "save_pretrain" and then use transformers for generation
  4. I am not using nightlies but torchtune.version: "0.3.1+cpu"
pbontrager commented 1 month ago

One sanity check would be to try qlora on torchtune first and confirm whether the loss looks reasonable or not. If qlora doesn't give you the results you expect, then it's likely a torchtune configuration problem. If qlora works fine, then it's probably an issue with finding the right hyperparameters for full finetuning on your dataset.