Multi-round conversation data set

tianyi-lab / Cherry_LLM

[NAACL'24] Self-data filtering of LLM instruction-tuning data using a novel perplexity-based difficulty score, without using any other models

287 stars 19 forks source link

Multi-round conversation data set #8

Closed wuQi-666 closed 10 months ago

wuQi-666 commented 10 months ago

Hello, I observed that the alpaca_data.json dataset we used is in the form of a single round of dialogue. May I ask if you have considered IFD screening for data sets with multiple rounds of dialogue?

MingLiiii commented 10 months ago

At least in this project, we are not going to explore the multi-round situation.

However, we don't think it would be hard to implement the IFD score to multi-round. And I think another choice is to check OpenChat? You can see if their method works for you~

wuQi-666 commented 10 months ago

Can you give me the project address for this method? Thanks

MingLiiii commented 10 months ago

https://github.com/imoneoi/openchat