Multilinguality - Githubissues

vince62s commented 1 year ago

Hi,

Can you confirm you kept (hence did not filter) the 20 langues of Wikipedia (same as original llama) ?

Thanks.

young-geng commented 1 year ago

We directly used the RedPajama dataset and did not filtered any data ourselves.

lucasjinreal commented 1 year ago

Does there any statistics data on multi-language distribution?

Does the vocab size contains most language's tokens? For instance Chinese characters

young-geng commented 1 year ago

We have not done such statistics ourselves, but you can directly download the redpajama dataset to check for yourself

vince62s commented 1 year ago

I am closing the issue maybe will reopen when we get a checkpoint at 1T. For now (at 400K) when I finetune with languages EN+DE+FR I am getting a higher loss compared to original llama, not very very big difference but still enough to make a point.

openllama7B [2023-05-17 15:45:03,597 INFO] Step 10/10000; acc: 48.8; ppl: 12.2; xent: 2.5; lr: 0.00002; sents: 1133; bsz: 671/ 671/ 4; 1549/1549 tok/s; 139 sec; [2023-05-17 15:46:30,200 INFO] Step 20/10000; acc: 50.9; ppl: 10.6; xent: 2.4; lr: 0.00002; sents: 870; bsz: 658/ 658/ 3; 2431/2431 tok/s; 225 sec; [2023-05-17 15:47:55,939 INFO] Step 30/10000; acc: 50.3; ppl: 10.9; xent: 2.4; lr: 0.00002; sents: 888; bsz: 643/ 643/ 3; 2399/2399 tok/s; 311 sec; [2023-05-17 15:49:21,891 INFO] Step 40/10000; acc: 50.7; ppl: 10.7; xent: 2.4; lr: 0.00002; sents: 896; bsz: 648/ 648/ 3; 2411/2411 tok/s; 397 sec; [2023-05-17 15:50:52,345 INFO] Step 50/10000; acc: 50.1; ppl: 11.0; xent: 2.4; lr: 0.00002; sents: 908; bsz: 663/ 663/ 3; 2347/2347 tok/s; 487 sec; [2023-05-17 15:52:21,651 INFO] Step 60/10000; acc: 51.1; ppl: 10.5; xent: 2.4; lr: 0.00002; sents: 883; bsz: 664/ 664/ 3; 2379/2379 tok/s; 577 sec; [2023-05-17 15:53:51,786 INFO] Step 70/10000; acc: 51.3; ppl: 10.3; xent: 2.3; lr: 0.00002; sents: 881; bsz: 666/ 666/ 3; 2365/2365 tok/s; 667 sec; [2023-05-17 15:55:21,288 INFO] Step 80/10000; acc: 50.7; ppl: 10.8; xent: 2.4; lr: 0.00002; sents: 1042; bsz: 661/ 661/ 3; 2365/2365 tok/s; 756 sec; [2023-05-17 15:56:51,380 INFO] Step 90/10000; acc: 51.4; ppl: 10.1; xent: 2.3; lr: 0.00002; sents: 913; bsz: 649/ 649/ 3; 2306/2306 tok/s; 846 sec; [2023-05-17 15:58:25,369 INFO] Step 100/10000; acc: 51.2; ppl: 10.1; xent: 2.3; lr: 0.00002; sents: 865; bsz: 641/ 641/ 3; 2184/2184 tok/s; 940 sec;

llama7B [2023-05-17 16:49:28,748 INFO] Step 10/10000; acc: 51.4; ppl: 10.5; xent: 2.4; lr: 0.00002; sents: 1181; bsz: 701/ 701/ 4; 1593/1593 tok/s; 141 sec; [2023-05-17 16:50:58,589 INFO] Step 20/10000; acc: 52.8; ppl: 9.5; xent: 2.2; lr: 0.00002; sents: 937; bsz: 686/ 686/ 3; 2444/2444 tok/s; 231 sec; [2023-05-17 16:52:28,744 INFO] Step 30/10000; acc: 53.4; ppl: 9.1; xent: 2.2; lr: 0.00002; sents: 942; bsz: 683/ 683/ 3; 2423/2423 tok/s; 321 sec; [2023-05-17 16:53:57,111 INFO] Step 40/10000; acc: 53.5; ppl: 9.0; xent: 2.2; lr: 0.00002; sents: 946; bsz: 669/ 669/ 3; 2423/2423 tok/s; 409 sec; [2023-05-17 16:55:25,692 INFO] Step 50/10000; acc: 53.4; ppl: 8.9; xent: 2.2; lr: 0.00002; sents: 940; bsz: 670/ 670/ 3; 2421/2421 tok/s; 498 sec; [2023-05-17 16:56:55,035 INFO] Step 60/10000; acc: 53.9; ppl: 8.6; xent: 2.2; lr: 0.00002; sents: 917; bsz: 673/ 673/ 3; 2411/2411 tok/s; 587 sec; [2023-05-17 16:58:23,985 INFO] Step 70/10000; acc: 53.5; ppl: 8.7; xent: 2.2; lr: 0.00002; sents: 904; bsz: 669/ 669/ 3; 2405/2405 tok/s; 676 sec; [2023-05-17 16:59:52,961 INFO] Step 80/10000; acc: 53.4; ppl: 9.0; xent: 2.2; lr: 0.00002; sents: 1080; bsz: 680/ 680/ 3; 2444/2444 tok/s; 765 sec; [2023-05-17 17:01:20,781 INFO] Step 90/10000; acc: 53.4; ppl: 8.8; xent: 2.2; lr: 0.00002; sents: 964; bsz: 662/ 662/ 3; 2413/2413 tok/s; 853 sec; [2023-05-17 17:02:49,235 INFO] Step 100/10000; acc: 54.2; ppl: 8.3; xent: 2.1; lr: 0.00002; sents: 903; bsz: 660/ 660/ 3; 2389/2389 tok/s; 941 sec;

lucasjinreal commented 1 year ago

@vince62s Did u using vocab size 32k or 40k?

vince62s commented 1 year ago

there is only 32k vocabs. btw, the loss difference above may not come from multiliguality issues because Ia m having the same (maybe even bigger) when finetuning vicuna on openllama vs llama.

openlm-research / open_llama

Multilinguality #18