tianyi-lab / Cherry_LLM

[NAACL'24] Self-data filtering of LLM instruction-tuning data using a novel perplexity-based difficulty score, without using any other models
306 stars 21 forks source link

Chinese SFT data cannot be displayed. #14

Closed JieDengsc closed 11 months ago

JieDengsc commented 11 months ago

I'm using Chinese SFTdata for code execution. After the "pre_experience_selection.sh" file is executed, the "alpaca_data_pre.json" file is obtained, but all Chinese characters in the file are changed to \uxxxx. Therefore, the “Train Pre-Experienced Model” file cannot be executed.

Can you check whether “data_by_cluster” and “data_analysis” do not support Chinese?

Thank you.

MingLiiii commented 11 months ago

We just read the JSON and save the JSON for the given samples and do not generate any new sentences in the process. So I think you should check the saving command.

JieDengsc commented 11 months ago

Thanks for your reply, I found out that it was really a saving problem.

Also, can I refer to your sh file configuration for “Train Pre-Experienced Model”?

MingLiiii commented 11 months ago

Sure, I think 1000 samples for training 1 epoch should be a good starting configuration~