tianyi-lab / Cherry_LLM

[NAACL'24] Self-data filtering of LLM instruction-tuning data using a novel perplexity-based difficulty score, without using any other models
306 stars 21 forks source link

Questions related to training #11

Closed JieDengsc closed 12 months ago

JieDengsc commented 12 months ago

Thank you for sharing

I'm trying to train models using my Chinese SFT data. I have some questions as follows: 1) My first step is to run "pre_experience_analysis.sh", but it seems to run all my json data. Is that reasonable? It takes a long time. The "start_idx" and "end_idx" of “data_analysis.py” are not set in your code.

2) Do I need to modify the code for my own Chinese SFT data? Or just use it normally.

MingLiiii commented 12 months ago

Thanks for your interest in our work!

The direct answer for your Q1 is YES. We found that the best way to train a pre-experienced model is to consider diversity. Thus we try to gain the embeddings for all the data and select by diversity. However: 1 If your base model is already really powerful, you can try to neglect the pre-experienced model and directly run the cherry_analysis on the base model. 2 You can also randomly choose some data for the training of the pre-experienced model. Though not as good as considering diversity, it still works. 3 You can also use other quick methods to consider the diversity. For example, sentence_bert + K means.

For the second question, I don't know what base model and what SFT data you use, so I can not give a definite answer. But I think in most situations, you don't need to modify it.

JieDengsc commented 12 months ago

Thank you for your reply!

Because I saw the previous text saying "Learning from Brief Experience" by selecting a small amount of data, I'm not sure it's right to put all the data into it for training. In addition, full data takes a long time to train.

I'll try it. Thank you.

MingLiiii commented 12 months ago

Ah, I am not sure if there is still a misunderstanding.

For the pre-experienced model, it indeed only needs a small amount of data. The code "pre_experience_analysis.sh" you were asking for is not "put all the data into it for training", it just tries to select a suitable small amount of the data for training the pre-experienced model.

JieDengsc commented 12 months ago

Thank you for your reply.

Maybe I'm not asking the question accurately. The "pre_experience_analysis.sh" script does not perform training. It embeds all SFT data (that is, "get_perplexity_and_embedding_whole_text") and then uses the "pre_expeerience_selection.sh" script to perform clustering.

Is my understanding correct?

Thank you again

MingLiiii commented 12 months ago

Yes, I think you are correct~