hkust-nlp / deita

Deita: Data-Efficient Instruction Tuning for Alignment [ICLR2024]
Apache License 2.0
466 stars 28 forks source link

why all multi-turn data? #30

Open lulia0228 opened 1 week ago

lulia0228 commented 1 week ago

Hello, the work is very detailed and excellent. I have a question: Why is the 10k data all multi-turn data? Why not add single-turn dialogue data?

VPeterV commented 1 week ago

Hi, thanks for your interest! The majority of the data in our original SOTA data pool, which primarily includes datasets like ShareGPT, consists of multi-turn conversations. As a result, multi-turn data are more likely to be selected. However, as noted in our paper, the selected data from redundant pool including datasets like Alpaca does include many single-turn data, therefore the selected data will not contain many multi-turn data.