kyegomez / BitNet

Implementation of "BitNet: Scaling 1-bit Transformers for Large Language Models" in pytorch
https://discord.gg/qUtxnK2NMf
MIT License
1.69k stars 155 forks source link

[BUG] Loss drops, model still produces gibberish? #23

Closed MichelNivard closed 5 months ago

MichelNivard commented 8 months ago

Describe the bug

After 5300 iteraitons loss near 2.7, is it still supposed to spit out near giberish?

To Reproduce

Running on CPU, macbookkair M2, omitting the model.cuda() line

Expected behaviour

Some kind of convergence on sentences that are at least english-ish?

Screenshots image

Additional context

Maybe my expectations are just off and I should train way way more?

Upvote & Fund

Fund with Polar

kyegomez commented 8 months ago

@MichelNivard try training it now and see what happens, I've made many optimizations

MichelNivard commented 8 months ago

Okay digging into it later today, thanks!

xwin commented 8 months ago

Hi, I trained model using train.py script to completion, although I used a larger batch size and less epochs due to different GPU usued for training.

training loss: 2.462737798690796
validation loss: 2.5802037715911865

However the model produces gibberish

nlsl,slontpg -ytasetcratiioec m  eenu u- nol b m=&o eliets ao =e raersly rif  rc&ssp eaeteen se llr l vc o&roi eet e-e ialsl dsssenr-cffso&- clafsebnnnu&o&ld&&s l&t;spe &e&n g=cciobod& re broen b o&  geposc efi&lu& lcercudrondllailo&na&dnienhi it en h & f&k& e lo&&p  n t ilng,itptoe& &l &opc-pi   mr&& l-=o&l &eetnsc& rdhe&ctn&e air std lciedeimm=ap&&c&ttoyi&c&a;&  e aa aa&s&oelaabueaconksts&    e&glll r& orrhad    ecn etant&c &   te& nc t& m  ugoleetcic&&eadtryr&hl eelairfd &prnldsiectl&sar fnup c&ie a c&in

The validation line was

'ml]  === The Octave Harmonica ===  Octave harmonicas have two reeds per hole.  The two reeds are tuned to the same note a perfect octave apart.  Many share their basic design with the tremolo harmonica explained above and are built upon this "Weiner system" of construction.  Octave harmonicas also come in what is called the "Knittlinger system".  In this design the top and bottom reed-plates contain all of the blow and draw notes for either to lower or higher pitched set of reeds.  The comb is constructed so that the blow and draw reeds on each reed-plate are paired side-by-side in a single chamber in the same manner as on a standard diatonic but that the top and bottom pairs each have their own chamber.  Thus, in a C harmonica the higher pitched C blow and D draw found in the first "hole" would be placed side-by-side on the upper reed-plate and share a single chamber in the comb and the lower pitched C blow and D draw would be placed side-by-side on the bottom reed-plate and sha'
JohnnyOpcode commented 7 months ago

Could we add proper checkpointing to the training loop in train.py?

I've tried torch.save({}), but the model can't be opened with Netron for validation. I'm missing something obviously ..

github-actions[bot] commented 5 months ago

Stale issue message

Dayun0925 commented 2 weeks ago

i trained the model using the parameters provided in the code, however, the loss seemed remain at 5.4 or 5.3 , and produced gibberish