r-three / common-pile

Repo to hold code and track issues for the collection of permissively licensed data
MIT License
22 stars 6 forks source link

Changes made while scraping the whole Project Gutenberg data. #78

Closed blester125 closed 8 months ago

blester125 commented 8 months ago

The full scrape is here https://huggingface.co/datasets/blester125/project-gutenberg-dolma

This PR includes some updates I made while processing all the Project Gutenberg data.

Changes include: