Simplify Gutenberg scraping (no more rsync, no more fallback URLs / filenames)

kelson42 commented 4 years ago

Gutenberg scraping is pretty touchy and takes time to run. We should see if it can be simplified.

A start of this discussion has happend here https://github.com/openzim/gutenberg/issues/93#issuecomment-552984940

kelson42 commented 4 years ago

@dattaz @rgaudin Would you be able to summarise why this is so slow/complicated? So we can assessed alternatives in a second step?

kelson42 commented 4 years ago

Today, we download http://www.gutenberg.org/cache/epub/feeds/rdf-files.tar.bz2 which has metadata for all books (is that the same kind of data like https://www.gutenberg.org/feeds/catalog.rdf.bz2?). This process (download + parsing) takes a few hours. It looks like the RDF is not 100% OK because we make an rsync to list all the files on gb server and have a way to check that the URLs given by the RDF are really on the server. But then we have the correct EPUB URLS and we can rely on them to download the data.

@eshellman So the question is: the OPDS is a standard and looks easier to deal with for us... but if it suffers from the same problem in data quality like the RDF we use today, then we will end-up with exactly the same kind of "bad solution".

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

eshellman commented 3 years ago

It seems I never commented on this. The PG OPDS is archaic. it does things like embedding covers as blobs in the xml.

Probably I meant to take a look at the rdf parsing code to see what the inefficiency is. I'm pretty sure my own parsing code doesn't take nearly as long: https://github.com/gitenberg-dev/gitberg/blob/fc3da308f3ccdfe034d2e873efff9adf6a66730f/gitenberg/metadata/pg_rdf.py#L267

kelson42 commented 3 years ago

@eshellman From our perspective having a proper OPDS would be preferable, because this is a standard.

eshellman commented 3 years ago

we'll probably go straight to opds 2.0 (json-based) sometime this year

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

rgaudin commented 1 year ago

Has this OPDS2 effort been through? https://m.gutenberg.org/ebooks.opds/ is still 1.1 and XML and returns

<!--
DON'T USE THIS PAGE FOR SCRAPING.
Seriously. You'll only get your IP blocked.
Download https://www.gutenberg.org/feeds/catalog.rdf.bz2 instead,
which contains *all* Project Gutenberg metadata in one RDF/XML file.
-->

kelson42 commented 1 year ago

The best to answer is probably @eshellman

eshellman commented 1 year ago

I would not rely on that feed, as it was coded before the OPDS standard had stabilized, and does not get much usage. Implementation of OPDS 2 will happen eventually; most of 2022 was spent on the EPUB3 and HTML5 output files. In addition to the RDF, there is now a CSV file that gets a fair amount of use.

benoit74 commented 1 year ago

I just had a look into some stuff around this issue, and this is what I understood (do not hesitate to correct me if I'm wrong).

When the scraper starts, we now (2022) retrieve two sources:

rdf-files.tar.bz2
rsync of rsync://dante.pglaf.org/gutenberg/

The RDF tar is opened up and every individual RDF file is used to get:

a list of all licenses: slug (PD, None, Copyright) + name (full text representing the license)
a list of formats: id (autogenerated) + mime + images (boolean indicating if the format contains images or not) + pattern (how to generate the filename based on book id)
a list of authors: gut_id (id of the author in gutenberg, agent in RDF) + last_name + first_names + birth_date + death_date
a list of books : id (gutemberg) + title + subtitle + author_id (link to authors above) + license_id (link to licenses above) + language + downloads + bookshelf + coverpage (unused ???) + a list of format_id (link to formats above)

The rsync is populating a table of relative URLs found on the server.

When downloading the book, the following is done for every book:

based on a provided list of formats (pdf, epub, html by default), we look after which format found in the RDF is the most "likely" to represent the file we want to download (e.g. there is typically multiple epub format available for on single book, but we have to choose one)
if we have stored in the database the download URL for this book in this format, we consider
otherwise, we infer a list of potential filenames based on known patterns (e.g. for epub, we look after .../pg{book_id}.epub, .../pg{book_id}-images.epub, .../pg{book_id}-noimages.epub where .../ is based on server base path + folder structure base on book id) ; we then look after urls (from rsync) for the ones matching ; we try to download the file from matching urls, the first succesful one is considered as the one we want and is stored in database as the appropriate one for this book in this format
the cover is then downloaded from a known URL (if available)

These files are then optimized / cached / stored in the final ZIM.

Some remarks :

for every book id, just like we know in advance the URL of the cover in advance (e.g. https://dante.pglaf.org/cache/epub/2650/pg2650.cover.medium.jpg) we know as well the URL of the RDF file of this book (e.g. https://dante.pglaf.org/cache/epub/2650/pg2650.rdf)
this RDF of every individual book seems to contain sufficient metadata + download links for every file format we currently use

The links found in the various RDFs seem to always point to www.gutenberg.org server, is it ok to use it for scraping ? (I have no idea why we use dante.pglaf.org instead of www.gutenberg.org)

If the remarks above are true, and since it is now quite easy / fast to get a list of book ids (via the new CSV file), we can imagine a new processing structure where we first get this CSV and build a list of book ids, and then we can directly jump to the processing of individual books, with the download of its RDF and then its files.

This would means no more need to rsync tons of stuff + no more need to untar all RDFs if we want to extract only few books (particularly important for testing and specific languages) and overall very limited upfront time. Maybe no big change in overall processing time when running a full, since we would download all RDFs one by one instead of a huge tar. But definitely a simplified processing from my PoV + something faster for small tests.

rgaudin commented 1 year ago

since it is now quite easy / fast to get a list of book ids (via the new CSV file), we can imagine a new processing structure where we first get this CSV and build a list of book ids, and then we can directly jump to the processing of individual books, with the download of its RDF and then its files.

The CSV file does include the list of languages which is great but it doesn't include the list of formats so it would not be sufficient to run all the filters (languages, formats, book_ids). Would still work (you'd filter once you've parsed the RDF) but you wouldn't know in advance which books you'd need to process.

One thing that could be done is to replace the indiscriminate extraction and parsing of all RDF files. This takes a lot of time because this has to go through the filesystem. Instead, we could (after filtering books_ids from the lang request via CSV) only extract from rdf-files.tar.bz2 (in memory) and parse (from memory) the individual book IDs.

This would be a lot faster on small selections and probably still faster on large ones ; but bundling two consecutive tasks together.

I don't see how the rsync step can be replaced though and this is the longest one for me (40m today) because we use it to match filenames. One thing we could do is save its result in the optimization cache. As the recipe runs on zimfarm worker periodically, there would a somewhat recent file available for developers to use. Kind of hackish but could be useful.

I have no idea why we use dante.pglaf.org instead of www.gutenberg.org

It's the mirror and that's the recommended way to.

eshellman commented 1 year ago

www.gutenberg.org has load balancers and securuty hardware in front of it that aren't very friendly to scrapers - it's architected for large numbers of users. dante just has bandwidth.

benoit74 commented 1 year ago

Instead, we could (after filtering books_ids from the lang request via CSV) only extract from rdf-files.tar.bz2 (in memory) and parse (from memory) the individual book IDs.

This makes a lot of sense to me.

I don't see how the rsync step can be replaced though and this is the longest one for me (40m today) because we use it to match filenames.

I not sure I get this. Why do you have to match filenames? Do you mean that the URL on dante is not the same as the one on www.gutenberg.org which is the URL we get from the RDF?

rgaudin commented 1 year ago

I not sure I get this. Why do you have to match filenames? Do you mean that the URL on dante is not the same as the one on www.gutenberg.org which is the URL we get from the RDF?

I'm not sure exactly but I believe there are many files referenced in RDF but not all are present on the server and testing online URLs each time was too slow, inefficient and wasteful.

Not sure how linked this is to the fact we were using different mirrors in the past.

Anyway now that the scraper is in a better shape I'd advise you test whether we still need that rsync step or not.

Should be fairly easy: loop through file entries in all of the RDF files and check if those urls arr present in rsync. If none is missing ; it would mean we can trust RDF files.

eshellman commented 1 year ago

If you have examples of missing files, I can check for you what the issue is.

kelson42 commented 1 year ago

Actually, if such information are welcome on PG side, would be better IMO to share errors with PG so the can fix it, than inventing solutions to circunvent them.

eshellman commented 1 year ago

without seeing examples, it's hard to know where the problem lies.

benoit74 commented 1 year ago

I will perform a full comparison of the data sources we use and let you know if I find something unexpected, so we can take a decision of what to do collectively. Probably this week or next one.

benoit74 commented 1 year ago

I may finally have gained a bit more of understanding around why we need to rsync URLs from dante.pglaf.org.

Here is a sample data for book ID 1088:

Mime type: text/html 
URL in RDF: https://www.gutenberg.org/files/1088/1088-h.zip
File downloaded from mirror : http://dante.pglaf.org/1/0/8/1088/1088-h.zip

Mime type: application/epub+zip
URL in RDF: https://www.gutenberg.org/ebooks/1088.epub.images
File downloaded from mirror: http://dante.pglaf.org/cache/epub/1088/pg1088.epub

We see that the URL used on dante.pglaf.org is absolutely not the same relative path as the one mentioned in the RDF. While on www.gutenberg.org we can use both path structure (/files/1088/1088-h.zip or /1/0/8/1088/1088-h.zip), only the second one works on dante.pglaf.org.

We also see that even the filename is very different for epub.

I also had a look at the RDF file present on http://dante.pglaf.org/cache/epub/1088/pg1088.rdf, but it is also only mentioning URLs of www.gutenberg.org.

And there is almost the same issue regarding the cover image, where multiple resolutions are mentioned in the RDF but only one is available on dante.pglaf.org, with a different filename.

I'm a bit lost regarding what we should next if we want to simplify scraping further / get rid of the rsync.

@eshellman, are you aware of all this? Any suggestion?

eshellman commented 1 year ago

We could update the rdf - it hasn't been touched 5years, or I could contribute a some code that does the translation. or both. The first case is lack of symlinks, the second is an apache rewrite directive.

On Jan 27, 2023, at 3:52 AM, benoit74 @.***> wrote:

I may finally have gained a bit more of understanding around why we need to rsync URLs from dante.pglaf.org.

Here is a sample data for book ID 1088:

Mime type: text/html URL in RDF: https://www.gutenberg.org/files/1088/1088-h.zip File downloaded from mirror : http://dante.pglaf.org/1/0/8/1088/1088-h.zip

Mime type: application/epub+zip URL in RDF: https://www.gutenberg.org/ebooks/1088.epub.images File downloaded from mirror: http://dante.pglaf.org/cache/epub/1088/pg1088.epub We see that the URL used on dante.pglaf.org is absolutely not the same relative path as the one mentioned in the RDF. While on www.gutenberg.org we can use both path structure (/files/1088/1088-h.zip or /1/0/8/1088/1088-h.zip), only the second one works on dante.pglaf.org.

We also see that even the filename is very different for epub.

I also had a look at the RDF file present on http://dante.pglaf.org/cache/epub/1088/pg1088.rdf, but it is also only mentioning URLs of www.gutenberg.org.

And there is almost the same issue regarding the cover image, where multiple resolutions are mentioned in the RDF but only one is available on dante.pglaf.org, with a different filename.

I'm a bit lost regarding what we should next if we want to simplify scraping further / get rid of the rsync.

@eshellman https://github.com/eshellman, are you aware of all this? Any suggestion?

— Reply to this email directly, view it on GitHub https://github.com/openzim/gutenberg/issues/97#issuecomment-1406194074, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHCGMO5X3Q5TXQELVGSELTWUOEGHANCNFSM4JNNZJOA. You are receiving this because you were mentioned.

eshellman commented 1 year ago

I looked at the code generating the rdf, and I could just delete all the code the does the file-url transformation! Will ask around to see if there's any reason not to.

On Jan 27, 2023, at 3:52 AM, benoit74 @.***> wrote:

I may finally have gained a bit more of understanding around why we need to rsync URLs from dante.pglaf.org.

Here is a sample data for book ID 1088:

Mime type: text/html URL in RDF: https://www.gutenberg.org/files/1088/1088-h.zip File downloaded from mirror : http://dante.pglaf.org/1/0/8/1088/1088-h.zip

Mime type: application/epub+zip URL in RDF: https://www.gutenberg.org/ebooks/1088.epub.images File downloaded from mirror: http://dante.pglaf.org/cache/epub/1088/pg1088.epub We see that the URL used on dante.pglaf.org is absolutely not the same relative path as the one mentioned in the RDF. While on www.gutenberg.org we can use both path structure (/files/1088/1088-h.zip or /1/0/8/1088/1088-h.zip), only the second one works on dante.pglaf.org.

We also see that even the filename is very different for epub.

I also had a look at the RDF file present on http://dante.pglaf.org/cache/epub/1088/pg1088.rdf, but it is also only mentioning URLs of www.gutenberg.org.

And there is almost the same issue regarding the cover image, where multiple resolutions are mentioned in the RDF but only one is available on dante.pglaf.org, with a different filename.

I'm a bit lost regarding what we should next if we want to simplify scraping further / get rid of the rsync.

@eshellman https://github.com/eshellman, are you aware of all this? Any suggestion?

— Reply to this email directly, view it on GitHub https://github.com/openzim/gutenberg/issues/97#issuecomment-1406194074, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHCGMO5X3Q5TXQELVGSELTWUOEGHANCNFSM4JNNZJOA. You are receiving this because you were mentioned.

benoit74 commented 1 year ago

Any or both changes would be more than welcomed! Thank you, keep us informed.

eshellman commented 1 year ago

I have created a small module bringing together the relevant code with a method that will turn the canonical urls into archive urls: https://github.com/gutenbergtools/libgutenberg/blob/master/pg_archive_urls.py

I've identified 3 files (of 1.2 million) in a legacy format that need fixing from our side

eshellman commented 1 year ago

Didn't make sense to mess with the rdf.

rgaudin commented 1 year ago

Great news ! We'll test and integrate it

benoit74 commented 1 year ago

Looks very promising, thank you very much !

I've integrated it manually (i.e. copy/pasted the code until you release it) and modified the DB schema to store more info for debugging. I will perform a full run of the parser to compare what we have downloaded with URLs guessed from rsync and URLs available in RDF as translated by your lib. I'll let you know.

First try on one single book is OK (i.e. we could have used the RDF urls directly).

benoit74 commented 1 year ago

I did a full run, looking for all formats (epub, html, pdf) and all languages.

Only 59251 books have at least one of those formats. I'm downloading a full archive to confirm this number is ok, but it looks so. By the way, is there any plan to support those additional formats (text/plain + audio books)? Or any reason not to add them?

For most books, the URLs (which we hint currently from rsync results + patterns) we download for epub, html, pdf and cover image are already present in the RDF files and are equal once converted through the small module mentioned above.

The only exceptions are below. @eshellman could you have a look and confirm this should / could be fixed on your side?

book_id|format|download_url
5031|html|http://dante.pglaf.org/5/0/3/5031/5031-h.zip
11220|html|http://dante.pglaf.org/cache/epub/11220/pg11220.html.utf8
15831|pdf|http://dante.pglaf.org/1/5/8/3/15831/15831-pdf.pdf
10802|html|http://dante.pglaf.org/cache/epub/10802/pg10802.html.utf8
28701|html|http://dante.pglaf.org/2/8/7/0/28701/28701-h.zip
28803|html|http://dante.pglaf.org/2/8/8/0/28803/28803-h.zip
28821|html|http://dante.pglaf.org/2/8/8/2/28821/28821-h.zip
28959|html|http://dante.pglaf.org/2/8/9/5/28959/28959-h.zip
28969|html|http://dante.pglaf.org/2/8/9/6/28969/28969-h.zip
31100|html|http://dante.pglaf.org/3/1/1/0/31100/31100-h.zip
29156|html|http://dante.pglaf.org/2/9/1/5/29156/29156-h.zip
29434|html|http://dante.pglaf.org/2/9/4/3/29434/29434-h.zip
29441|html|http://dante.pglaf.org/2/9/4/4/29441/29441-h.zip
29467|html|http://dante.pglaf.org/2/9/4/6/29467/29467-h.zip
30580|html|http://dante.pglaf.org/3/0/5/8/30580/30580-h.zip
41450|html|http://dante.pglaf.org/4/1/4/5/41450/41450-h.zip
51830|html|http://dante.pglaf.org/5/1/8/3/51830/51830-h.zip
66127|html|http://dante.pglaf.org/6/6/1/2/66127/66127-h.zip
69909|cover|http://dante.pglaf.org/cache/epub/69909/pg69909.cover.medium.jpg
69910|cover|http://dante.pglaf.org/cache/epub/69910/pg69910.cover.medium.jpg
69911|cover|http://dante.pglaf.org/cache/epub/69911/pg69911.cover.medium.jpg
69912|cover|http://dante.pglaf.org/cache/epub/69912/pg69912.cover.medium.jpg
69913|cover|http://dante.pglaf.org/cache/epub/69913/pg69913.cover.medium.jpg
69915|cover|http://dante.pglaf.org/cache/epub/69915/pg69915.cover.medium.jpg
69916|cover|http://dante.pglaf.org/cache/epub/69916/pg69916.cover.medium.jpg
69917|cover|http://dante.pglaf.org/cache/epub/69917/pg69917.cover.medium.jpg
69918|cover|http://dante.pglaf.org/cache/epub/69918/pg69918.cover.medium.jpg
69919|cover|http://dante.pglaf.org/cache/epub/69919/pg69919.cover.medium.jpg
69920|cover|http://dante.pglaf.org/cache/epub/69920/pg69920.cover.medium.jpg
69921|cover|http://dante.pglaf.org/cache/epub/69921/pg69921.cover.medium.jpg
69922|cover|http://dante.pglaf.org/cache/epub/69922/pg69922.cover.medium.jpg
69923|cover|http://dante.pglaf.org/cache/epub/69923/pg69923.cover.medium.jpg
69924|cover|http://dante.pglaf.org/cache/epub/69924/pg69924.cover.medium.jpg
69925|cover|http://dante.pglaf.org/cache/epub/69925/pg69925.cover.medium.jpg
69926|cover|http://dante.pglaf.org/cache/epub/69926/pg69926.cover.medium.jpg
69927|cover|http://dante.pglaf.org/cache/epub/69927/pg69927.cover.medium.jpg
69928|cover|http://dante.pglaf.org/cache/epub/69928/pg69928.cover.medium.jpg
69930|cover|http://dante.pglaf.org/cache/epub/69930/pg69930.cover.medium.jpg
69931|cover|http://dante.pglaf.org/cache/epub/69931/pg69931.cover.medium.jpg
69932|cover|http://dante.pglaf.org/cache/epub/69932/pg69932.cover.medium.jpg
69934|cover|http://dante.pglaf.org/cache/epub/69934/pg69934.cover.medium.jpg
69935|cover|http://dante.pglaf.org/cache/epub/69935/pg69935.cover.medium.jpg
69936|cover|http://dante.pglaf.org/cache/epub/69936/pg69936.cover.medium.jpg
69937|cover|http://dante.pglaf.org/cache/epub/69937/pg69937.cover.medium.jpg
69938|cover|http://dante.pglaf.org/cache/epub/69938/pg69938.cover.medium.jpg
69939|cover|http://dante.pglaf.org/cache/epub/69939/pg69939.cover.medium.jpg

Anyway, this is probably a significant first confirmation that your module works fine and that we can probably get rid of the rsync step and maybe other complexities for hinting appropriate file name for the various formats. I will continue to explore this by looking at how to select the appropriate file from the RDF for the three formats we currently support.

benoit74 commented 1 year ago

The number 59251 is not ok, but it looks like I'm missing some urls from rsync ... I might have messed up with messy data ... I will run it once more with the whole rsync step to confirm, I might have missed some records.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

benoit74 commented 3 months ago

I've updated this issue title to better reflect the current state of the discussion here.

eshellman commented 3 months ago

I have updated https://github.com/gutenbergtools/libgutenberg/blob/master/pg_archive_urls.py
PG is now generating a zip file for the html5 version of every book including all of the images. I think these will be much easier to use for openzim, as well as more efficient wrt bandwidth.
I think I've mentioned this before, but I maintain a list of PG numbers that are not books (and won't have html5/zips available) or are not being used: https://github.com/gitenberg-dev/gitberg/blob/master/gitenberg/data/missing.tsv

openzim / gutenberg

Simplify Gutenberg scraping (no more rsync, no more fallback URLs / filenames) #97