imoneoi / openchat

OpenChat: Advancing Open-source Language Models with Imperfect Data
https://openchat.team
Apache License 2.0
5.23k stars 399 forks source link

about data #206

Open Luoqiu76 opened 6 months ago

Luoqiu76 commented 6 months ago

May I ask if you can tell me how the sharegpt_clean. json file is changed to openchat_v3.2_super.train.parquet? I noticed that there is a lot of data difference between the two, some of which were truncated due to being too long, but I also noticed that some garbled data is also discarded. But there are still many data in sharegpt_clean where the Model field is not marked as GPT3.5 or GPT4. How does this part of the data determine whether it belongs to GPT3.5 or GPT4, or whether it belongs entirely to GPT3.5?