Open kelson42 opened 4 years ago
@Tpt Maybe you can help here?
@eshellman If Gutenberg project would provide an OPDS stream as well, that would make things so much easier and quicker to run.
Gutenberg's OPDS feed dates from the early days of OPDS and it shows - we'll probably jump to v2 instead of changing it. http://www.gutenberg.org/ebooks.opds/
@rgaudin Hmmm... any reason you can remember why we don't have use it 5 years ago at the time we have created gutemberg2zim?
I don't recall. Did it exist back then? Source says:
DON'T USE THIS PAGE FOR SCRAPING.
Seriously. You'll only get your IP blocked.
Download https://www.gutenberg.org/feeds/catalog.rdf.bz2 instead,
which contains *all* Project Gutenberg metadata in one RDF/XML file.
This catalog file (272MiB) looks like a good base for metadata but it only contains IDs, not links to the contents. I think that's why we had to rsync stuff.
@rgaudin Thank you very much for this quick but insightful analysis. @eshellman Any change we can (1) use it for scraping (2) get the important information (links) within the OPDS stream?
The nastynote is an artifact of the templating system. it can be ignored. Is (2) referring to the RDF dump? because every file should be listed there. Maybe not the easiest format, but I have scripts to do the conversion. Based on our conversation, I had assumed that adding this would be a relatively easy way to improve the scraper. I'll ask the students today if they want to tackle it, otherwise I'll put it on my own list.
@eshellman Everything is feasible, and probably easy. I just try to figure out what would be the best approach to do things. I will move the discussion topic of simplifying Gutenberg scraping to an other ticket (this ticket is primary about Wikisouce). If you have other sources of Ebooks (which you have), it would be great if you could open on ticket per source and give a few details about these new sources and in particular in which format is the catalog.
This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.
This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.
... through OPDS feed https://tools.wmflabs.org/wsexport/wikisource-fr-good.atom