does this project support Chinese datasets selection?

tianyi-lab / Superfiltering

[ACL'24] Superfiltering: Weak-to-Strong Data Filtering for Fast Instruction-Tuning

123 stars 10 forks source link

does this project support Chinese datasets selection? #5

Closed lihongxiacream closed 3 months ago

lihongxiacream commented 3 months ago

Can I choose another models,like Chinese GPT2 model, and will the performance be affected?

lihongxiacream commented 3 months ago

The IDF score of Chinese GPT2 model is strange and there is much zero in the result 20240730-103639

MingLiiii commented 3 months ago

I think it is probably caused by my Exception Handling module. You can delete the

try:
exception:

and then see what bugs do you have.

lihongxiacream commented 3 months ago

Thanks!! I solve this problem and have another question. The gpt2_chinese filter out many questions and answers that do not belong to the same language. I think this kind of data does make the IDF high because the pre-trained model does favor languages with consistent output and questions. gpt2 does not have this situation, but it is only good at generate English. 20240730-141621

lihongxiacream commented 3 months ago

The performance of training data selected by GPT2-Chinese is worse than GPT2

MingLiiii commented 3 months ago

I guess that is because GPT2-Chinese was originally worse than the GPT2 model.