Closed vince62s closed 1 year ago
We directly used the RedPajama dataset and did not filtered any data ourselves.
Does there any statistics data on multi-language distribution?
Does the vocab size contains most language's tokens? For instance Chinese characters
We have not done such statistics ourselves, but you can directly download the redpajama dataset to check for yourself
I am closing the issue maybe will reopen when we get a checkpoint at 1T. For now (at 400K) when I finetune with languages EN+DE+FR I am getting a higher loss compared to original llama, not very very big difference but still enough to make a point.
openllama7B [2023-05-17 15:45:03,597 INFO] Step 10/10000; acc: 48.8; ppl: 12.2; xent: 2.5; lr: 0.00002; sents: 1133; bsz: 671/ 671/ 4; 1549/1549 tok/s; 139 sec; [2023-05-17 15:46:30,200 INFO] Step 20/10000; acc: 50.9; ppl: 10.6; xent: 2.4; lr: 0.00002; sents: 870; bsz: 658/ 658/ 3; 2431/2431 tok/s; 225 sec; [2023-05-17 15:47:55,939 INFO] Step 30/10000; acc: 50.3; ppl: 10.9; xent: 2.4; lr: 0.00002; sents: 888; bsz: 643/ 643/ 3; 2399/2399 tok/s; 311 sec; [2023-05-17 15:49:21,891 INFO] Step 40/10000; acc: 50.7; ppl: 10.7; xent: 2.4; lr: 0.00002; sents: 896; bsz: 648/ 648/ 3; 2411/2411 tok/s; 397 sec; [2023-05-17 15:50:52,345 INFO] Step 50/10000; acc: 50.1; ppl: 11.0; xent: 2.4; lr: 0.00002; sents: 908; bsz: 663/ 663/ 3; 2347/2347 tok/s; 487 sec; [2023-05-17 15:52:21,651 INFO] Step 60/10000; acc: 51.1; ppl: 10.5; xent: 2.4; lr: 0.00002; sents: 883; bsz: 664/ 664/ 3; 2379/2379 tok/s; 577 sec; [2023-05-17 15:53:51,786 INFO] Step 70/10000; acc: 51.3; ppl: 10.3; xent: 2.3; lr: 0.00002; sents: 881; bsz: 666/ 666/ 3; 2365/2365 tok/s; 667 sec; [2023-05-17 15:55:21,288 INFO] Step 80/10000; acc: 50.7; ppl: 10.8; xent: 2.4; lr: 0.00002; sents: 1042; bsz: 661/ 661/ 3; 2365/2365 tok/s; 756 sec; [2023-05-17 15:56:51,380 INFO] Step 90/10000; acc: 51.4; ppl: 10.1; xent: 2.3; lr: 0.00002; sents: 913; bsz: 649/ 649/ 3; 2306/2306 tok/s; 846 sec; [2023-05-17 15:58:25,369 INFO] Step 100/10000; acc: 51.2; ppl: 10.1; xent: 2.3; lr: 0.00002; sents: 865; bsz: 641/ 641/ 3; 2184/2184 tok/s; 940 sec;
llama7B [2023-05-17 16:49:28,748 INFO] Step 10/10000; acc: 51.4; ppl: 10.5; xent: 2.4; lr: 0.00002; sents: 1181; bsz: 701/ 701/ 4; 1593/1593 tok/s; 141 sec; [2023-05-17 16:50:58,589 INFO] Step 20/10000; acc: 52.8; ppl: 9.5; xent: 2.2; lr: 0.00002; sents: 937; bsz: 686/ 686/ 3; 2444/2444 tok/s; 231 sec; [2023-05-17 16:52:28,744 INFO] Step 30/10000; acc: 53.4; ppl: 9.1; xent: 2.2; lr: 0.00002; sents: 942; bsz: 683/ 683/ 3; 2423/2423 tok/s; 321 sec; [2023-05-17 16:53:57,111 INFO] Step 40/10000; acc: 53.5; ppl: 9.0; xent: 2.2; lr: 0.00002; sents: 946; bsz: 669/ 669/ 3; 2423/2423 tok/s; 409 sec; [2023-05-17 16:55:25,692 INFO] Step 50/10000; acc: 53.4; ppl: 8.9; xent: 2.2; lr: 0.00002; sents: 940; bsz: 670/ 670/ 3; 2421/2421 tok/s; 498 sec; [2023-05-17 16:56:55,035 INFO] Step 60/10000; acc: 53.9; ppl: 8.6; xent: 2.2; lr: 0.00002; sents: 917; bsz: 673/ 673/ 3; 2411/2411 tok/s; 587 sec; [2023-05-17 16:58:23,985 INFO] Step 70/10000; acc: 53.5; ppl: 8.7; xent: 2.2; lr: 0.00002; sents: 904; bsz: 669/ 669/ 3; 2405/2405 tok/s; 676 sec; [2023-05-17 16:59:52,961 INFO] Step 80/10000; acc: 53.4; ppl: 9.0; xent: 2.2; lr: 0.00002; sents: 1080; bsz: 680/ 680/ 3; 2444/2444 tok/s; 765 sec; [2023-05-17 17:01:20,781 INFO] Step 90/10000; acc: 53.4; ppl: 8.8; xent: 2.2; lr: 0.00002; sents: 964; bsz: 662/ 662/ 3; 2413/2413 tok/s; 853 sec; [2023-05-17 17:02:49,235 INFO] Step 100/10000; acc: 54.2; ppl: 8.3; xent: 2.1; lr: 0.00002; sents: 903; bsz: 660/ 660/ 3; 2389/2389 tok/s; 941 sec;
@vince62s Did u using vocab size 32k or 40k?
there is only 32k vocabs. btw, the loss difference above may not come from multiliguality issues because Ia m having the same (maybe even bigger) when finetuning vicuna on openllama vs llama.
Hi,
Can you confirm you kept (hence did not filter) the 20 langues of Wikipedia (same as original llama) ?
Thanks.