Closed JieDengsc closed 12 months ago
Thanks for your interest in our work!
The direct answer for your Q1 is YES. We found that the best way to train a pre-experienced model is to consider diversity. Thus we try to gain the embeddings for all the data and select by diversity. However: 1 If your base model is already really powerful, you can try to neglect the pre-experienced model and directly run the cherry_analysis on the base model. 2 You can also randomly choose some data for the training of the pre-experienced model. Though not as good as considering diversity, it still works. 3 You can also use other quick methods to consider the diversity. For example, sentence_bert + K means.
For the second question, I don't know what base model and what SFT data you use, so I can not give a definite answer. But I think in most situations, you don't need to modify it.
Thank you for your reply!
Because I saw the previous text saying "Learning from Brief Experience" by selecting a small amount of data, I'm not sure it's right to put all the data into it for training. In addition, full data takes a long time to train.
I'll try it. Thank you.
Ah, I am not sure if there is still a misunderstanding.
For the pre-experienced model, it indeed only needs a small amount of data. The code "pre_experience_analysis.sh" you were asking for is not "put all the data into it for training", it just tries to select a suitable small amount of the data for training the pre-experienced model.
Thank you for your reply.
Maybe I'm not asking the question accurately. The "pre_experience_analysis.sh" script does not perform training. It embeds all SFT data (that is, "get_perplexity_and_embedding_whole_text") and then uses the "pre_expeerience_selection.sh" script to perform clustering.
Is my understanding correct?
Thank you again
Yes, I think you are correct~
Thank you for sharing
I'm trying to train models using my Chinese SFT data. I have some questions as follows: 1) My first step is to run "pre_experience_analysis.sh", but it seems to run all my json data. Is that reasonable? It takes a long time. The "start_idx" and "end_idx" of “data_analysis.py” are not set in your code.
2) Do I need to modify the code for my own Chinese SFT data? Or just use it normally.