openzim / zim-requests

Want a new ZIM file? Propose ZIM content improvements or fixes? Here you are!
https://farm.openzim.org
37 stars 2 forks source link

Survivor library #1130

Open tprice0424 opened 1 month ago

tprice0424 commented 1 month ago

Please use the following format for a ZIM creation request (and delete unnecessary information)

tprice0424 commented 1 month ago

https://www.survivorlibrary.com/index.php/main-library-index/ this is the actual part i am interested in its a list of different subject packed full of pdf downloads for each subject

RavanJAltaie commented 1 month ago

Recipe created https://farm.openzim.org/recipes/survivorlibrary.com_en_all I'll update the library link once ready

tprice0424 commented 1 month ago

Thank ya so much

tprice0424 commented 4 weeks ago

Recipe created

https://farm.openzim.org/recipes/survivorlibrary.com_en_all

I'll update the library link once ready

Will this be available to download and add to my library?

RavanJAltaie commented 3 weeks ago

https://dev.library.kiwix.org/viewer#survivorlibrary.com_en_all_2024-08 The PDFs are not scraped properly. @benoit74 could you please check that?

gordon-matt commented 3 weeks ago

I'm looking forward to this one as well, please. In fact, this is a duplicate request of #590 .

schlegelt1 commented 3 weeks ago

https://dev.library.kiwix.org/viewer#survivorlibrary.com_en_all_2024-08 The PDFs are not scraped properly. @benoit74 could you please check that?

Seems only the recently added pdfs were scraped and the library was untouched. There was an older scrape that was over 3x larger so presumably that gathered the whole collection.

tprice0424 commented 3 weeks ago

https://dev.library.kiwix.org/viewer#survivorlibrary.com_en_all_2024-08 The PDFs are not scraped properly. @benoit74 could you please check that?

Seems only the recently added pdfs were scraped and the library was untouched. There was an older scrape that was over 3x larger so presumably that gathered the whole collection.

Where can I find that by chance?

schlegelt1 commented 2 weeks ago

Here. When you download it though it's only 7G, despite stating 33G. The point I was trying to make was that the 7G version is missing quite a bit of data.

tprice0424 commented 2 weeks ago

Here. When you download it though it's only 7G, despite stating 33G. The point I was trying to make was that the 7G version is missing quite a bit of data.

Thanks I am just running some tests on it this will be good information for if they get the whole thing whenever it gets figured out

benoit74 commented 1 week ago

Crawl has been interrupted due to a browser crash:

{"timestamp":"2024-08-14T21:28:02.062Z","logLevel":"info","context":"worker","message":"Starting page","details":{"workerid":0,"page":"https://www.survivorlibrary.com/library/Railroads.zip"}}
{"timestamp":"2024-08-14T21:28:02.062Z","logLevel":"info","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":488,"total":14749,"pending":1,"failed":0,"limit":{"max":0,"hit":false},"pendingPages":["{\"seedId\":0,\"started\":\"2024-08-14T21:28:01.922Z\",\"extraHops\":0,\"url\":\"https:\\/\\/www.survivorlibrary.com\\/library\\/Railroads.zip\",\"added\":\"2024-08-14T21:03:16.088Z\",\"depth\":2}"]}}
{"timestamp":"2024-08-14T21:28:32.074Z","logLevel":"warn","context":"fetch","message":"Direct fetch capture attempt timed out","details":{"seconds":30,"page":"https://www.survivorlibrary.com/library/Railroads.zip","workerid":0}}
{"timestamp":"2024-08-14T21:28:32.074Z","logLevel":"info","context":"general","message":"Awaiting page load","details":{"page":"https://www.survivorlibrary.com/library/Railroads.zip","workerid":0}}
{"timestamp":"2024-08-14T21:29:20.736Z","logLevel":"warn","context":"recorder","message":"Large streamed written to WARC, but not returned to browser, requires reading into memory","details":{"url":"https://www.survivorlibrary.com/library/Railroads.zip","actualSize":2181241056,"maxSize":5000000}}
{"timestamp":"2024-08-14T21:29:33.279Z","logLevel":"error","context":"browser","message":"Browser disconnected (crashed?), interrupting crawl","details":{}}
{"timestamp":"2024-08-14T21:29:33.279Z","logLevel":"warn","context":"recorder","message":"Failed to load response body","details":{"url":"https://www.survivorlibrary.com/library/Railroads.zip","networkId":"C2F5ACEE365BF893AECC1FBDA2AB8275","type":"exception","message":"Protocol error (Fetch.getResponseBody): Target closed","stack":"TargetCloseError: Protocol error (Fetch.getResponseBody): Target closed\n    at CallbackRegistry.clear (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/CallbackRegistry.js:69:36)\n    at CdpCDPSession._onClosed (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/cdp/CDPSession.js:98:25)\n    at #onClose (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/cdp/Connection.js:163:21)\n    at WebSocket.<anonymous> (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/node/NodeWebSocketTransport.js:43:30)\n    at callListener (/app/node_modules/puppeteer-core/node_modules/ws/lib/event-target.js:290:14)\n    at WebSocket.onClose (/app/node_modules/puppeteer-core/node_modules/ws/lib/event-target.js:220:9)\n    at WebSocket.emit (node:events:519:28)\n    at WebSocket.emitClose (/app/node_modules/puppeteer-core/node_modules/ws/lib/websocket.js:272:10)\n    at Socket.socketOnClose (/app/node_modules/puppeteer-core/node_modules/ws/lib/websocket.js:1341:15)\n    at Socket.emit (node:events:519:28)","page":"https://www.survivorlibrary.com/library/Railroads.zip","workerid":0}}
{"timestamp":"2024-08-14T21:29:33.280Z","logLevel":"error","context":"general","message":"Page Load Failed, skipping page","details":{"msg":"Protocol error (Page.navigate): Target closed","loadState":0,"page":"https://www.survivorlibrary.com/library/Railroads.zip","workerid":0}}
{"timestamp":"2024-08-14T21:29:33.345Z","logLevel":"info","context":"worker","message":"Worker done, all tasks complete","details":{"workerid":0}}
{"timestamp":"2024-08-14T21:30:35.191Z","logLevel":"info","context":"writer","message":"Rollover size exceeded, creating new WARC","details":{"size":2359264272,"oldFilename":"rec-fb1bfb4f4017-20240814212746176-0.warc.gz","newFilename":"rec-fb1bfb4f4017-20240814213035191-0.warc.gz","rolloverSize":1000000000,"id":"0"}}
{"timestamp":"2024-08-14T21:30:35.340Z","logLevel":"info","context":"general","message":"Saving crawl state to: /output/.tmpxh8drw0d/collections/crawl-20240814210212921/crawls/crawl-20240814213035-fb1bfb4f4017.yaml","details":{}}
{"timestamp":"2024-08-14T21:30:35.349Z","logLevel":"info","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":489,"total":14749,"pending":0,"failed":1,"limit":{"max":0,"hit":false},"pendingPages":[]}}
{"timestamp":"2024-08-14T21:30:35.350Z","logLevel":"info","context":"general","message":"Crawling done","details":{}}
{"timestamp":"2024-08-14T21:30:35.351Z","logLevel":"info","context":"general","message":"Exiting, Crawl status: interrupted","details":{}}

It was downloading https://www.survivorlibrary.com/library/Railroads.zip which looks like a big ZIP of all Railroads PDFs. I don't know where the crawler find this link, but it does find it. Does anyone knows on which page this link is displayed?

Since I don't think these ZIPs are very helpful in a ZIM since the individual PDFs provide much more value, I've reconfigured the recipe to only include PDFs. Let's wait to see how it goes: https://farm.openzim.org/pipeline/5c994a30-e608-4bf5-9837-e67f7b7c9819

Popolechien commented 1 week ago

The .zip is at https://www.survivorlibrary.com/index.php/Railroads (the railroad category). All other files there are PDF, but I suspect there may be a couple other zip files here and there, seeing as the library is so big.

benoit74 commented 1 week ago

Oh, I got misled by the fact that the link says "PDF" ...

Popolechien commented 1 week ago

Looking at the last iteration https://dev.library.kiwix.org/viewer#survivorlibrary.com_en_all_2024-09 Files seem scraped until Farming, and within that page stopped working from Corn Culture 1910 onwards (which is not an unusually large document), 9th item in the list.

benoit74 commented 1 week ago

Looking at the log, browsertrix crawler failed while downloading https://dev.library.kiwix.org/content/survivorlibrary.com_en_all_2024-09/www.survivorlibrary.com/library/corn_and_corn_improvement_1955.pdf which is a 837mb file ... not really negligible ^^

I will open an issue in zimit / browsertrix crawler to seek for advice, but I have a beginning of solution in mind.

I have disabled the recipe until then.

Popolechien commented 1 week ago

Yep I saw it but the weird part is that the file 837 MB is part of the zim (can open & read it)

benoit74 commented 1 week ago

Good point, indeed, looking with more attention to the logs, it looks like it crash right after saving the file to disk. will mention it in browsertrix issue.

benoit74 commented 1 day ago

Upstream issue is solved, I've requested again the recipe

Scraper is now supposed to be able to scrape even the huge ZIP, so we will need to decide if we keep current configuration which includes only PDF documents (at the expense of "broken" external links for ZIP files and other formats) or we include every document (at the expense of double ZIM size more or less).

benoit74 commented 1 day ago

PS: description is ugly from my PoV (no uppercase on first letter, barely english sentence) and illustration is not a link to an image but to a webpage (and I don't get why we provide an icon, we are not pleased with the website favicon which is used by default by the scraper ? https://www.survivorlibrary.com/wp-content/uploads/2024/06/cropped-Librarian-90-1-192x192.jpg)

Popolechien commented 1 day ago

I do not think that their favicon will render well, but I've updated it along with the description.

Out of curiosity: are these added to the file at the end of the scraping process or at the start?

benoit74 commented 6 hours ago

Favicon are added after the crawl by browsertrix but at the beginning of warc2zim conversion, between the initial search for main page and the processing of all WARC records to ZIM records.

Dev ZIM seems to be pretty good now, would you mind to review it? https://dev.library.kiwix.org/viewer#survivorlibrary.com_en_all_2024-09/www.survivorlibrary.com

Three things I've noticed so far:

And we need to decide about the big ZIP, since the ZIM is already 114.9G, I really recommend to just not include them (at the price of few broken links, but that's life). Any chance we might ask the website owner to customize a bit the page so that we can hide these links?

Popolechien commented 6 hours ago

Ah weird, your link does not work and needs to be the full https://dev.library.kiwix.org/viewer#survivorlibrary.com_en_all_2024-09/www.survivorlibrary.com/index.php/main-library-index otherwise I just get Content not found

@tprice0424 Did you contact the website owner at any point? He might actually be interested in sharing the zim file.

tprice0424 commented 4 hours ago

@Popolechien i have not contacted web owner I was just working on being able to put this on an off grid raspberry pi hotspot for use with no internet, so I didn't contact web owner

benoit74 commented 2 hours ago

Blog & Store: that's trivial, there's no incentive for us to remove it (and maybe someone will want to use their store)

OK (even if for the store it does not feel very aligned with other discussions we had in the past about stuff that is barely usable / useful when offline)

Couldn't get a hold of wp-admin pages but as long as they do not create a security concern then we should live with them as long as they do not impact UX

I stumbled upon one of them with the "random" functionality of kiwix-serve.

I don't see why we should bother the owner to rearrange his collection only for our comfort

Because it might be mostly priceless to do a very small HTML change (and potentially useful for them as well). Would be sad not to ask if we are in contact with them.

By the look of it each zip file contains all pdf files in any given category. Including a zip in a zim is probably not too useful at this stage.

OK, will continue to exclude them then.

I will rework the recipe to properly include the whole website but zip files.