r-three / common-pile

Repo to hold code and track issues for the collection of permissively licensed data
MIT License
23 stars 7 forks source link

Update to Ubuntu Chat processing #80

Closed blester125 closed 11 months ago

blester125 commented 11 months ago

The full data lives here https://huggingface.co/datasets/blester125/ubuntu-chat-dolma

This PR updates the ubuntu processing to filter out empty chats, there were a lot of empty ones (happens when there aer no messages for a channel on some day).