tianyi-lab / Cherry_LLM

[NAACL'24] Self-data filtering of LLM instruction-tuning data using a novel perplexity-based difficulty score, without using any other models
287 stars 19 forks source link

How to filter code SFT data? #10

Closed wyjksyjs closed 10 months ago

wyjksyjs commented 10 months ago

Very impressive work. I would like to ask a question. The paper says that IFD is ineffective for code sft data. Is there any improvement method specifically for code sft data screening? Thanks

MingLiiii commented 10 months ago

Thank you very much for your interest! In the paper, we claim that IFD is ineffective for code data because there is so little code-related data that has a relatively high IFD score. So I think if you specifically want code-related data, you can increase the number of code data chosen. For example, directly calculating the IFD scores on code cluster to ensure the number of code data.

wyjksyjs commented 10 months ago

Thank you very much for your interest! In the paper, we claim that IFD is ineffective for code data because there is so little code-related data that has a relatively high IFD score. So I think if you specifically want code-related data, you can increase the number of code data chosen. For example, directly calculating the IFD scores on code cluster to ensure the number of code data.

Thank you for your reply. Following your suggestion, my understanding is to train a initial model using a subset of code data separately, and then evaluate the IFD value of the full code data.