Open lulia0228 opened 1 week ago
Hi, thanks for your interest! The majority of the data in our original SOTA data pool, which primarily includes datasets like ShareGPT, consists of multi-turn conversations. As a result, multi-turn data are more likely to be selected. However, as noted in our paper, the selected data from redundant pool including datasets like Alpaca does include many single-turn data, therefore the selected data will not contain many multi-turn data.
Hello, the work is very detailed and excellent. I have a question: Why is the 10k data all multi-turn data? Why not add single-turn dialogue data?