Open almugabo opened 1 month ago
This seems like a super cool usecase @almugabo !
Lemme ask a couple follow up questions:
thank you for the reply. The language I am trying to "teach Llama" is Kinyarwanda (language spoken in Rwanda one of the official languages alongside English and French).
As mentioned, it "works" with PEFT/Qlora but I was hoping to get better performance with full fine-tuning.
{"text": "Yevgeny Prigozhin wayoboraga abarwanyi ba Wagner yapfuye. Itangazamakuru rya Leta y’u Burusiya riravuga ko iyo ndege yavaga mu Majyaruguru ya Moscow yerekeza mu mujyi wa St Petersburg, yari irimo abantu icumi barimo abagenzi barindwi n’abakozi b’iyo ndege batatu, bose bakaba bahasize ubuzima. Icyakora uruhande rw’abarwanyi Yevgeny Prigozhin yari ayoboye nta cyo rwahise rubitangazaho. Abo barwanyi bari baherutse kugaba ibitero byari bigamije guhirika ubutegetsi bw’u Burusiya, gusa hakaba andi makuru avuga ko gahunda yo guhirika ubutegetsi nta yari ihari, ahubwo ko Yevgeny Prigozhin n’abarwanyi be baba barakiriye amafaranga bahawe na Leta Zunze Ubumwe za Amerika nka ruswa, ibyo gushaka guhirika ubutegetsi bakabikora ari uburyo bwo kwiyerurutsa, ndetse bigategurwa ku bwumvikane na Perezida w’u Burusiya Vladimir Putin. Icyakora abandi baravuga ko nubwo Putin nta cyemezo gikomeye yafatiye abo bashatse guhirika ubutegetsi bwe, ashobora kuba yarakomeje kubagirira amakenga. Mu gihe bamwe bavuga ko Putin yashoboraga kumuhitana kubera ubwo bugambanyi bwe, abandi baravuga ko na Amerika yashoboraga kumugirira nabi kubera ko yayibeshye ndetse akabatwarira n’amafaranga ntakore ibyo bumvikanye. Abarwanyi ba Wagner yari ayoboye kandi, bakunze kubangamira inyungu za Amerika mu bihugu bakoreramo bya Afurika. Prigozhin ntiyakunze kugaragara mu ruhame nyuma y’uko muri Kamena 2023 ayoboye kudeta yamaze amasaha 24 ariko ntigire icyo igeraho. Yaherukaga kugaragara muri video mu ntangiriro z’iki cyumweru, iyo video bikavugwa ko yafatiwe muri Afurika ahantu hatatangajwe. Umunyamakuru @ h_malachie", "nwords": 219, "ntokens_llama32": 614}
{"text": "Musanze:Abanyerondo baketse umusore ho ubujurura baramukubita bimuviramo urupfu. Abanyerondo bafashe umusore bakekaga ko ari umujura baramukubita ubundi bamunyuza mu muhanda hagati imodoka iramugonga ahita apfa.Mu kagari ka Gisesero, umurenge wa Busogo ho mu karere ka Musanze habereye impanuka yahitanye umusore wakekwagaho ibikorwa by’ubujura, Abaturage bavuga ko byatewe n’abanyerondo bagendaga bamukubita.Abatanze ubuhamya bavuga ko bari bahari bavuga ko uyu musore yakubiswe bikabije maze agata ubwenge cyangwa se inkoni ziramuhungabanya yerekera mu muhanda atabizi imodoka iramugongoNta makuru batanze avuga ko uyu musore yaba yarasanzwe yiba.icyakora bose bitsa ku kuba ngo abanyerondo bamukubise bamuketse nk’igisambo.Umuvugizi wa Polisi mu ntara y’Amajyaruguru Superitendent Jean Bosco Mwiseneza we yabwiye BTN ko uyu musore yazize umushofere utaringanije umuvuduko.Icyakora ntiyashimye gutanga amakuru ku bubasha abanyerondo bafite bwo kwambika umuntu amapingo bakamukubita byamuviriyemo urupfu", "nwords": 125, "ntokens_llama32": 376}
{"text": "Minisitiri wa Siporo yakiriye Team Rwanda ivuye muri Shampiyona Nyafurika. Ni igikorwa cyabaye kuri uyu wa Mbere tariki ya 08 Werurwe 2021, nk’uko tubikesha urubuga rwa Twitter rwa Minisiteri ya Siporo, rwanditseho ko Minisitiri wa Siporo Aurore Mimosa Munyangaju yashimye umusaruro w’Ikipe y’Igihugu y’Amagare.\nYagize ati \"Ibyo mwakoze turabibashimira, mwahaye ibyishimo Abanyarwanda. Abanyarwanda bose bamaze kumva ko aho mugiye nta mpungenge, ko muzitwara neza\". Kapiteni wa Team Rwanda, Joseph Areruya, yavuze ko imidali 14 batwaye idahagije ugereranyije n’ibyo bifuzaga, bikaba byaratewe n’uko bakoze imyiteguro idahagije kubera Covid-19. Yongeraho ko hakenewe imyitozo myinshi kugira ngo barusheho kwitegura n’andi marushanwa ategerejwe arimo na Tour du Rwanda 2021. Shampiyona Nyafurika yebereye mu Mujyi wa Cairo mu Misiri kuva tariki ya 03 kugera ku ya 06 Werurwe 2021, aho ikipe y’u Rwanda yegukanye imidari 14 irimo umwe wa Zahabu wegukanywe na Tuyizere Etienne mu gusiganwa mu muhanda ( road race) mu cyiciro cy’ingimbi. Umunyamakuru wa Kigali Today/KT Radio @ KuradusengIsaac", "nwords": 155, "ntokens_llama32": 422}
{"text": "Urubanza rwa Bandora rwasubitswe ku munsi warwo wa mbere. Mu rubanza rutamaze imitota igera kuri 20 rubera mu Rukiko rwisumbuye rwa Nyarugenge, ubushinjacyaha bwabanje kumenyesha Bandora ibyaha byose aregwa, ariko bumusabye kwisobanura ahita atangaza ko adashobora kuburana. Urukiko rwahise rutangaza ko rugomba gusuzuma icyo cyifuzo, rukazamusubiza kuri uyu wa gatatu tariki 20/03/2013, ku isaha ya saa munani. Bandora yagejejwe mu Rwanda tariki 10/03/2013 akuwe muri Norvege, kubera ibyaha akurikiranyweho byo kugira uruhare muri Jenoside yakorewe Abatutsi mu 1994 n’ibyaha byibasiye inyokomuntu. Bandora yafatiwe muri Malawi aho yakoraga ubucuruzi ariko aza kurekurwa. Yahavuye ajya mu Bubiligi aho yafungiwe ariko naho akaza kurekurwa mbere yo kwerekeza muri Norvege. Bandora wavutse mu 1953 mu cyahoze ari Perefegitura ya Gikongoro, akekwaho kuba yaragize uruhare mu gutoza Interahamwe mu Bugesera no guhagarikira ubwicanyi. Emmanuel N. Hitimana", "nwords": 131, "ntokens_llama32": 373}
One sanity check would be to try qlora on torchtune first and confirm whether the loss looks reasonable or not. If qlora doesn't give you the results you expect, then it's likely a torchtune configuration problem. If qlora works fine, then it's probably an issue with finding the right hyperparameters for full finetuning on your dataset.
I am trying to full fine tune Llama3.2-1b to "teach" it another language (via continous pretraining). he idea is to have a model, which, given a prompt in a language , it continues the sentences in that language. I am using a dataset of about 25 Million words.
When I use Unsloth for Qlora finetuning a 4bit model, after 3 epochs, the model performs how I would expect it. (I give a prompt in that language and get as response new text in the language which makes sense.
However, when using Torchtune (with text completion dataset), even after 5 epochs the results are not what I would expect. It just continue in English or outputs non-sensical sentences. P.S: the loss also behaves funny. it goes down, then up then down almost erratic (downtrends).
My question is : what I am doing wrong ?
Below is my configuration file: