openzim / gutenberg

Scraper for downloading the entire ebooks repository of project Gutenberg
https://download.kiwix.org/zim/gutenberg
GNU General Public License v3.0
126 stars 37 forks source link

Support Wikisource EPUB import #93

Open kelson42 opened 4 years ago

kelson42 commented 4 years ago

... through OPDS feed https://tools.wmflabs.org/wsexport/wikisource-fr-good.atom

kelson42 commented 4 years ago

@Tpt Maybe you can help here?

kelson42 commented 4 years ago

@eshellman If Gutenberg project would provide an OPDS stream as well, that would make things so much easier and quicker to run.

eshellman commented 4 years ago

Gutenberg's OPDS feed dates from the early days of OPDS and it shows - we'll probably jump to v2 instead of changing it. http://www.gutenberg.org/ebooks.opds/

kelson42 commented 4 years ago

@rgaudin Hmmm... any reason you can remember why we don't have use it 5 years ago at the time we have created gutemberg2zim?

rgaudin commented 4 years ago

I don't recall. Did it exist back then? Source says:

DON'T USE THIS PAGE FOR SCRAPING.

Seriously. You'll only get your IP blocked.

Download https://www.gutenberg.org/feeds/catalog.rdf.bz2 instead,
which contains *all* Project Gutenberg metadata in one RDF/XML file.

This catalog file (272MiB) looks like a good base for metadata but it only contains IDs, not links to the contents. I think that's why we had to rsync stuff.

kelson42 commented 4 years ago

@rgaudin Thank you very much for this quick but insightful analysis. @eshellman Any change we can (1) use it for scraping (2) get the important information (links) within the OPDS stream?

eshellman commented 4 years ago

The nastynote is an artifact of the templating system. it can be ignored. Is (2) referring to the RDF dump? because every file should be listed there. Maybe not the easiest format, but I have scripts to do the conversion. Based on our conversation, I had assumed that adding this would be a relatively easy way to improve the scraper. I'll ask the students today if they want to tackle it, otherwise I'll put it on my own list.

kelson42 commented 4 years ago

@eshellman Everything is feasible, and probably easy. I just try to figure out what would be the best approach to do things. I will move the discussion topic of simplifying Gutenberg scraping to an other ticket (this ticket is primary about Wikisouce). If you have other sources of Ebooks (which you have), it would be great if you could open on ticket per source and give a few details about these new sources and in particular in which format is the catalog.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.