r-three / common-pile

Repo to hold code and track issues for the collection of permissively licensed data
MIT License
22 stars 6 forks source link

Library of Congress Public Domain Books #74

Open storytracer opened 4 months ago

storytracer commented 4 months ago

The Library of Congress Selected Digitized Books collection contains 135,500+ English public domain books with 47.6 billion tokens.

storytracer commented 4 months ago

I have uploaded the 2024-05-13 snapshot to HF: https://huggingface.co/datasets/storytracer/loc_books_dolma.

craffel commented 4 months ago

I have uploaded the 2024-05-13 snapshot to HF: https://huggingface.co/datasets/storytracer/loc_books_dolma.

They look pretty noisy. Is there some really basic heuristic filtering we can do? Say, filter out lines where the majority of characters are not alphanumeric?

storytracer commented 4 months ago

In my experience the OCR noise in digitized books is concentrated in the front matter of the book, because the book starts with several blank pages containing library stamps or speckles which get misinterpreted as characters by the OCR engine. The noise level after the first 1-10 pages of each book or so should be fine. Since every book has a different amount of pages in the front matter though, I couldn't think of a good heuristic yet.

craffel commented 4 months ago

Remove everything before the first N pure alphanumeric lines?

storytracer commented 4 months ago

That could go wrong when you have unusual front matters, the range is really quite diverse. I can work with PleiAs to develop a heuristic or even model, since they deal with a lot of OCR text as well and have developed a library for OCR metrics and a promising post-OCR correction model. But they also question whether a little bit of noise in the front matter actually makes any difference in training, so I would like to leave the OCR text untouched for now until we have more insights into that. Would be great to create a general post-OCR dolma tagger based on their research, which we could easily apply to many different datasets.

craffel commented 4 months ago

Noise is always bad!