raduangelescu / gutenbergpy

Gutenberg cache and query library
MIT License
36 stars 17 forks source link

PG now has clean txt downloads #23

Open eshellman opened 3 months ago

eshellman commented 3 months ago

Much of the code in this repo here is no longer needed because PG now supplies clean text files, 100% utf8, with uniform headers at urls like https://gutenberg.org/cache/epub/65869/pg65869.txt Also available are csv metadata files and a zipped tar archive of the txt files at https://gutenberg.org/cache/epub/feeds/ (these are regularly updated.)

raduangelescu commented 3 months ago

I looked a little at the csv metadata and it seems it does not contain some info like publisher, rights, num downloads. I am not sure that all books have 100% utf8 files with uniform headers but where they do I think I already use them in the textget function if I remember correctly, they still need some minor cleanup if people plan to use them in ml otherwise they will bias their data with the uniform header fields. Thanks for the info though, I will need to investigate more, as I would glaly ditch rdf. .

eshellman commented 3 months ago

The source txt files that gutenbergpy uses do indeed need cleanup and are not uniformly UTF-8, which is why we now generate clean utf8 text files.

Neither the RDF files nor the CSV files contain original publisher information, we're working on it, but we'll need help to make it reasably complete.

Rights info is contained in the uniform header.

Let me know if I can answer any questions!

Eric

On Jul 28, 2024, at 7:27 AM, raduangelescu @.***> wrote:

I looked a little at the csv metadata and it seems it does not contain some info like publisher, rights, num downloads. I am not sure that all books have 100% utf8 files with uniform headers but where they do I think I already use them in the textget function if I remember correctly, they still need some minor cleanup if people plan to use them in ml otherwise they will bias their data with the uniform header fields. Thanks for the info though, I will need to investigate more, as I would glaly ditch rdf. .

— Reply to this email directly, view it on GitHub https://github.com/raduangelescu/gutenbergpy/issues/23#issuecomment-2254349151, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHCGMO34YDTAUF2F6Q5BATZOR6NZAVCNFSM6AAAAABLSALRWWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENJUGM2DSMJVGE. You are receiving this because you authored the thread.