gururise / AlpacaDataCleaned

Alpaca dataset from Stanford, cleaned and curated
Apache License 2.0
1.46k stars 146 forks source link

Idea about better cleaning #18

Closed xor2003 closed 1 year ago

xor2003 commented 1 year ago
  1. Probably need to move cleaned data from one file to another so no need to check again and again.
  2. For one other model guys prepared Telegram bot. So people could read random Question/Answer and choose button:

Maybe make sense to get duplicate confirmations...

gururise commented 1 year ago
  1. Probably need to move cleaned data from one file to another so no need to check again and again.

    1. For one other model guys prepared Telegram bot. So people could read random Question/Answer and choose button:
    • Everything correct

    • Wrong question

    • Wrong answer

    • Skip it

Maybe make sense to get duplicate confirmations...

Not a bad idea. Maybe a web interface where people could view the dataset and suggest changes.

xor2003 commented 1 year ago
gururise commented 1 year ago
  • CI CD check for correct json

CI CD check for valid JSON done