mlfoundations / dclm

DataComp for Language Models
MIT License
1.14k stars 103 forks source link

Using Evaluation Prompts to Inform Data Selection #86

Open arnavmdas opened 1 week ago

arnavmdas commented 1 week ago

In the DataComp paper (original one for VLM's), some of the heuristics were based on features from the datasets that were used for evaluations. Is this permitted in the filtering track for DCLM? For example, are we allowed to use featurized MMLU prompts in our selection algorithm?

afang-story commented 1 week ago

This is allowed. But if you choose to do so, we would recommend doing additional decontamination checks similar to what is in the paper.