hkust-nlp / deita

Deita: Data-Efficient Instruction Tuning for Alignment [ICLR2024]
Apache License 2.0
502 stars 27 forks source link

Could you please publish the original data pool? #27

Closed ShadowTinker closed 6 months ago

ShadowTinker commented 6 months ago

Hi, First of all, thank you for your work and the great repo!

As stated in the title, could you please provide the original data pool used in your paper, especially $X_{sota}$. I have tried to obtain the dataset following the reference in the paper. However, I cannot find a version of ShareGPT and UltraChat Huggingface datasets that match the statistics stated in the paper. I would greatly appreciate it if you could provide the dataset or teach me how to filter out the two datasets from existing Huggingface datasets.

Best regards

VPeterV commented 6 months ago

Hi. Thanks for your interest! We have released the original SOTA data pool on: https://huggingface.co/datasets/AndrewZeng/deita_sota_pool

ShadowTinker commented 6 months ago

Thanks a ton for your help with the dataset issue I raised! I greatly appreciate the time you took to address my problem. Your work on the repository is amazing, and you're clearly committed to helping the community.