Open StellaAthena opened 1 year ago
I think they have about 15B cleaned? we should look more into it for sure.
Is the raw dataset of this available? There are quite a few restoration models which perform quite decently. Or we have the ability to develop one ourselves.
I'll email them and ask.
@conceptofmind here's what they said:
Stella, Thanks for writing. We just pulled the raw scans from Library of Congress using their Chronicling America API (and deleted them after processing). Their tech support is very helpful and can tell you the rate at which you can pull scans without getting blocked. With the lccn, you can also link the output back to a specific scan. Melissa
Given how low quality their OCR is, I think we should just re-do it.
Will look into extending what Aran and I have worked on to include the super-resolution experiments I discussed with Luca.
This may end up being a unique thing as well lol. Will have to write up a doc on it.
Name: American Stories Size: The paper reports 65.6 billion tokens total, and according to the authors only 75% of the documents are "Legible." This may project to ~ 50 billion tokens of usable text. License: Public Domain Description: