Open tprice0424 opened 1 month ago
https://www.survivorlibrary.com/index.php/main-library-index/ this is the actual part i am interested in its a list of different subject packed full of pdf downloads for each subject
Recipe created https://farm.openzim.org/recipes/survivorlibrary.com_en_all I'll update the library link once ready
Thank ya so much
Recipe created
https://farm.openzim.org/recipes/survivorlibrary.com_en_all
I'll update the library link once ready
Will this be available to download and add to my library?
https://dev.library.kiwix.org/viewer#survivorlibrary.com_en_all_2024-08 The PDFs are not scraped properly. @benoit74 could you please check that?
I'm looking forward to this one as well, please. In fact, this is a duplicate request of #590 .
https://dev.library.kiwix.org/viewer#survivorlibrary.com_en_all_2024-08 The PDFs are not scraped properly. @benoit74 could you please check that?
Seems only the recently added pdfs were scraped and the library was untouched. There was an older scrape that was over 3x larger so presumably that gathered the whole collection.
https://dev.library.kiwix.org/viewer#survivorlibrary.com_en_all_2024-08 The PDFs are not scraped properly. @benoit74 could you please check that?
Seems only the recently added pdfs were scraped and the library was untouched. There was an older scrape that was over 3x larger so presumably that gathered the whole collection.
Where can I find that by chance?
Here. When you download it though it's only 7G, despite stating 33G. The point I was trying to make was that the 7G version is missing quite a bit of data.
Here. When you download it though it's only 7G, despite stating 33G. The point I was trying to make was that the 7G version is missing quite a bit of data.
Thanks I am just running some tests on it this will be good information for if they get the whole thing whenever it gets figured out
Crawl has been interrupted due to a browser crash:
{"timestamp":"2024-08-14T21:28:02.062Z","logLevel":"info","context":"worker","message":"Starting page","details":{"workerid":0,"page":"https://www.survivorlibrary.com/library/Railroads.zip"}}
{"timestamp":"2024-08-14T21:28:02.062Z","logLevel":"info","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":488,"total":14749,"pending":1,"failed":0,"limit":{"max":0,"hit":false},"pendingPages":["{\"seedId\":0,\"started\":\"2024-08-14T21:28:01.922Z\",\"extraHops\":0,\"url\":\"https:\\/\\/www.survivorlibrary.com\\/library\\/Railroads.zip\",\"added\":\"2024-08-14T21:03:16.088Z\",\"depth\":2}"]}}
{"timestamp":"2024-08-14T21:28:32.074Z","logLevel":"warn","context":"fetch","message":"Direct fetch capture attempt timed out","details":{"seconds":30,"page":"https://www.survivorlibrary.com/library/Railroads.zip","workerid":0}}
{"timestamp":"2024-08-14T21:28:32.074Z","logLevel":"info","context":"general","message":"Awaiting page load","details":{"page":"https://www.survivorlibrary.com/library/Railroads.zip","workerid":0}}
{"timestamp":"2024-08-14T21:29:20.736Z","logLevel":"warn","context":"recorder","message":"Large streamed written to WARC, but not returned to browser, requires reading into memory","details":{"url":"https://www.survivorlibrary.com/library/Railroads.zip","actualSize":2181241056,"maxSize":5000000}}
{"timestamp":"2024-08-14T21:29:33.279Z","logLevel":"error","context":"browser","message":"Browser disconnected (crashed?), interrupting crawl","details":{}}
{"timestamp":"2024-08-14T21:29:33.279Z","logLevel":"warn","context":"recorder","message":"Failed to load response body","details":{"url":"https://www.survivorlibrary.com/library/Railroads.zip","networkId":"C2F5ACEE365BF893AECC1FBDA2AB8275","type":"exception","message":"Protocol error (Fetch.getResponseBody): Target closed","stack":"TargetCloseError: Protocol error (Fetch.getResponseBody): Target closed\n at CallbackRegistry.clear (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/CallbackRegistry.js:69:36)\n at CdpCDPSession._onClosed (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/cdp/CDPSession.js:98:25)\n at #onClose (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/cdp/Connection.js:163:21)\n at WebSocket.<anonymous> (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/node/NodeWebSocketTransport.js:43:30)\n at callListener (/app/node_modules/puppeteer-core/node_modules/ws/lib/event-target.js:290:14)\n at WebSocket.onClose (/app/node_modules/puppeteer-core/node_modules/ws/lib/event-target.js:220:9)\n at WebSocket.emit (node:events:519:28)\n at WebSocket.emitClose (/app/node_modules/puppeteer-core/node_modules/ws/lib/websocket.js:272:10)\n at Socket.socketOnClose (/app/node_modules/puppeteer-core/node_modules/ws/lib/websocket.js:1341:15)\n at Socket.emit (node:events:519:28)","page":"https://www.survivorlibrary.com/library/Railroads.zip","workerid":0}}
{"timestamp":"2024-08-14T21:29:33.280Z","logLevel":"error","context":"general","message":"Page Load Failed, skipping page","details":{"msg":"Protocol error (Page.navigate): Target closed","loadState":0,"page":"https://www.survivorlibrary.com/library/Railroads.zip","workerid":0}}
{"timestamp":"2024-08-14T21:29:33.345Z","logLevel":"info","context":"worker","message":"Worker done, all tasks complete","details":{"workerid":0}}
{"timestamp":"2024-08-14T21:30:35.191Z","logLevel":"info","context":"writer","message":"Rollover size exceeded, creating new WARC","details":{"size":2359264272,"oldFilename":"rec-fb1bfb4f4017-20240814212746176-0.warc.gz","newFilename":"rec-fb1bfb4f4017-20240814213035191-0.warc.gz","rolloverSize":1000000000,"id":"0"}}
{"timestamp":"2024-08-14T21:30:35.340Z","logLevel":"info","context":"general","message":"Saving crawl state to: /output/.tmpxh8drw0d/collections/crawl-20240814210212921/crawls/crawl-20240814213035-fb1bfb4f4017.yaml","details":{}}
{"timestamp":"2024-08-14T21:30:35.349Z","logLevel":"info","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":489,"total":14749,"pending":0,"failed":1,"limit":{"max":0,"hit":false},"pendingPages":[]}}
{"timestamp":"2024-08-14T21:30:35.350Z","logLevel":"info","context":"general","message":"Crawling done","details":{}}
{"timestamp":"2024-08-14T21:30:35.351Z","logLevel":"info","context":"general","message":"Exiting, Crawl status: interrupted","details":{}}
It was downloading https://www.survivorlibrary.com/library/Railroads.zip which looks like a big ZIP of all Railroads PDFs. I don't know where the crawler find this link, but it does find it. Does anyone knows on which page this link is displayed?
Since I don't think these ZIPs are very helpful in a ZIM since the individual PDFs provide much more value, I've reconfigured the recipe to only include PDFs. Let's wait to see how it goes: https://farm.openzim.org/pipeline/5c994a30-e608-4bf5-9837-e67f7b7c9819
The .zip is at https://www.survivorlibrary.com/index.php/Railroads (the railroad category). All other files there are PDF, but I suspect there may be a couple other zip files here and there, seeing as the library is so big.
Oh, I got misled by the fact that the link says "PDF" ...
Looking at the last iteration https://dev.library.kiwix.org/viewer#survivorlibrary.com_en_all_2024-09
Files seem scraped until Farming, and within that page stopped working from Corn Culture 1910
onwards (which is not an unusually large document), 9th item in the list.
Looking at the log, browsertrix crawler failed while downloading https://dev.library.kiwix.org/content/survivorlibrary.com_en_all_2024-09/www.survivorlibrary.com/library/corn_and_corn_improvement_1955.pdf which is a 837mb file ... not really negligible ^^
I will open an issue in zimit / browsertrix crawler to seek for advice, but I have a beginning of solution in mind.
I have disabled the recipe until then.
Yep I saw it but the weird part is that the file 837 MB is part of the zim (can open & read it)
Good point, indeed, looking with more attention to the logs, it looks like it crash right after saving the file to disk. will mention it in browsertrix issue.
Upstream issue is solved, I've requested again the recipe
Scraper is now supposed to be able to scrape even the huge ZIP, so we will need to decide if we keep current configuration which includes only PDF documents (at the expense of "broken" external links for ZIP files and other formats) or we include every document (at the expense of double ZIM size more or less).
PS: description is ugly from my PoV (no uppercase on first letter, barely english sentence) and illustration is not a link to an image but to a webpage (and I don't get why we provide an icon, we are not pleased with the website favicon which is used by default by the scraper ? https://www.survivorlibrary.com/wp-content/uploads/2024/06/cropped-Librarian-90-1-192x192.jpg)
I do not think that their favicon will render well, but I've updated it along with the description.
Out of curiosity: are these added to the file at the end of the scraping process or at the start?
Favicon are added after the crawl by browsertrix but at the beginning of warc2zim conversion, between the initial search for main page and the processing of all WARC records to ZIM records.
Dev ZIM seems to be pretty good now, would you mind to review it? https://dev.library.kiwix.org/viewer#survivorlibrary.com_en_all_2024-09/www.survivorlibrary.com
Three things I've noticed so far:
wp-admin/...
pages have been grabbed, we are obviously not interested at all in these pages.And we need to decide about the big ZIP, since the ZIM is already 114.9G, I really recommend to just not include them (at the price of few broken links, but that's life). Any chance we might ask the website owner to customize a bit the page so that we can hide these links?
Ah weird, your link does not work and needs to be the full https://dev.library.kiwix.org/viewer#survivorlibrary.com_en_all_2024-09/www.survivorlibrary.com/index.php/main-library-index otherwise I just get Content not found
@tprice0424 Did you contact the website owner at any point? He might actually be interested in sharing the zim file.
@Popolechien i have not contacted web owner I was just working on being able to put this on an off grid raspberry pi hotspot for use with no internet, so I didn't contact web owner
Blog & Store: that's trivial, there's no incentive for us to remove it (and maybe someone will want to use their store)
OK (even if for the store it does not feel very aligned with other discussions we had in the past about stuff that is barely usable / useful when offline)
Couldn't get a hold of wp-admin pages but as long as they do not create a security concern then we should live with them as long as they do not impact UX
I stumbled upon one of them with the "random" functionality of kiwix-serve.
I don't see why we should bother the owner to rearrange his collection only for our comfort
Because it might be mostly priceless to do a very small HTML change (and potentially useful for them as well). Would be sad not to ask if we are in contact with them.
By the look of it each zip file contains all pdf files in any given category. Including a zip in a zim is probably not too useful at this stage.
OK, will continue to exclude them then.
I will rework the recipe to properly include the whole website but zip files.
Please use the following format for a ZIM creation request (and delete unnecessary information)