gururise / AlpacaDataCleaned

Alpaca dataset from Stanford, cleaned and curated
Apache License 2.0
1.46k stars 146 forks source link

Contributing to the dataset curation with Argilla and the Alpaca Garbage collector #53

Closed dvsrepo closed 1 year ago

dvsrepo commented 1 year ago

Hi @gururise, first of all, thanks and congrats on this important effort.

I'm Dani from Argilla. We've spent some time looking at data quality issues of the Alpaca dataset and its translations. We're helping out teams of the Spanish and German efforts to use Argilla for flagging bad or problematic instructions so that they can be later fixed (either manually or with post-processing).

Along the way, we've spent some time labeling AlpacaDataCleaned. It has already good quality but there are still examples to improve, so we'd like to contribute.

Today we have released this model to help teams with cleaning up Alpaca translation, but this can be used to contribute to this repo too: https://huggingface.co/argilla/alpaca-garbage-collector-multilingual

We've also deployed this space for browsing and validating the records. This is what it shows for last night's version of AlpacaCleaned (login with argilla/1234).

We plan to spend some more time labeling and contributing back to this project. My question is if it would be possible to share a set of flagged records (with positional ids as in the original json) with you to make sure we edit them in the right way. For example, what do with requests related to attached photos, paintings, and so on.

gururise commented 1 year ago

Hello @dvsrepo

Thats very interesting what you've done so far. I would definitely be open to sharing a set of flagged records to further improve coordination.

gururise commented 1 year ago

I've incorporated the alpaca-garbage-collector-multilingual model into the gradio gui.