tianyi-lab / Cherry_LLM

[NAACL'24] Self-data filtering of LLM instruction-tuning data using a novel perplexity-based difficulty score, without using any other models
306 stars 21 forks source link

Could the Pre-Experienced Model be used in other different dataset? #12

Closed CNXDZS closed 12 months ago

CNXDZS commented 12 months ago

Hi authors,this project is great!I have some confusion and need your help. The Pre-Experienced Model(stage 3) I fine-tuned with a certain data could be used to filter other datasets?For example, I used the selected pre-experienced samples(stage 2) from alpaca_data to fined tune my pretrain model and obtained a Pre-Experienced Model,and then use this model to select cherry data from alpaca_data.But could I use this Pre-Experienced Model to filter cherry data from other datasets (such as firefly)? In other words,If I have to use the selected pre-experienced samples from other datasets(such as firefly), and then fine-tune my pretrain model to obtained a new Pre-Experienced Model? my english is poor,I don’t know if my description is clear or not..Thanks a lot!

MingLiiii commented 12 months ago

Thanks for your interest. I can understand your meaning. I would say I recommend using the data that share a similar distribution for the cherry data and pre-experienced data.

If you really don't want to train the pre-experienced model on the dataset you plan to select, like firefly. I think you can try to directly calculate the IFD scores on the base model you are trying to fine-tune. It might work just fine.