Closed wyjksyjs closed 10 months ago
Thank you very much for your interest! In the paper, we claim that IFD is ineffective for code data because there is so little code-related data that has a relatively high IFD score. So I think if you specifically want code-related data, you can increase the number of code data chosen. For example, directly calculating the IFD scores on code cluster to ensure the number of code data.
Thank you very much for your interest! In the paper, we claim that IFD is ineffective for code data because there is so little code-related data that has a relatively high IFD score. So I think if you specifically want code-related data, you can increase the number of code data chosen. For example, directly calculating the IFD scores on code cluster to ensure the number of code data.
Thank you for your reply. Following your suggestion, my understanding is to train a initial model using a subset of code data separately, and then evaluate the IFD value of the full code data.
Very impressive work. I would like to ask a question. The paper says that IFD is ineffective for code sft data. Is there any improvement method specifically for code sft data screening? Thanks