r-three / common-pile

Repo to hold code and track issues for the collection of permissively licensed data
MIT License
22 stars 6 forks source link

American Stories #27

Open StellaAthena opened 10 months ago

StellaAthena commented 10 months ago

Name: American Stories Size: The paper reports 65.6 billion tokens total, and according to the authors only 75% of the documents are "Legible." This may project to ~ 50 billion tokens of usable text. License: Public Domain Description:

The American Stories dataset is a collection of full article texts extracted from historical U.S. newspaper images. It includes nearly 20 million scans from the public domain Chronicling America collection maintained by the Library of Congress. The dataset is designed to address the challenges posed by complex layouts and low OCR quality in existing newspaper datasets. It was created using a novel deep learning pipeline that incorporates layout detection, legibility classification, custom OCR, and the association of article texts spanning multiple bounding boxes. It employs efficient architectures specifically designed for mobile phones to ensure high scalability. The dataset offers high-quality data that can be utilized for various purposes. It can be used to pre-train large language models and improve their understanding of historical English and world knowledge. The dataset can also be integrated into retrieval-augmented language models, making historical information more accessible, including interpretations of political events and details about people's ancestors. Additionally, the structured article texts in the dataset enable the use of transformer-based methods for applications such as detecting reproduced content. This significantly enhances accuracy compared to relying solely on existing OCR techniques. The American Stories dataset serves as an invaluable resource for developing multimodal layout analysis models and other multimodal applications. Its vast size and silver quality make it ideal for innovation and research in this domain. Thoughts: This dataset seems pretty rough. Old text is pretty dubious all the time, but this probably needs substantially cleaning before we can use it.

soldni commented 10 months ago

I think they have about 15B cleaned? we should look more into it for sure.

conceptofmind commented 10 months ago

Is the raw dataset of this available? There are quite a few restoration models which perform quite decently. Or we have the ability to develop one ourselves.

StellaAthena commented 9 months ago

I'll email them and ask.

StellaAthena commented 9 months ago

@conceptofmind here's what they said:

Stella, Thanks for writing. We just pulled the raw scans from Library of Congress using their Chronicling America API (and deleted them after processing). Their tech support is very helpful and can tell you the rate at which you can pull scans without getting blocked. With the lccn, you can also link the output back to a specific scan. Melissa

Given how low quality their OCR is, I think we should just re-do it.

conceptofmind commented 9 months ago

Will look into extending what Aran and I have worked on to include the super-resolution experiments I discussed with Luca.

This may end up being a unique thing as well lol. Will have to write up a doc on it.