openzim / zim-requests

Want a new ZIM file? Propose ZIM content improvements or fixes? Here you are!
https://farm.openzim.org
35 stars 2 forks source link

Last task of cheatography.com takes too long #1056

Open benoit74 opened 1 week ago

benoit74 commented 1 week ago

Recipe URL

https://farm.openzim.org/recipes/cheatography.com_en_all

Task URL

https://farm.openzim.org/pipeline/0dcc313a-243c-4431-8839-372500323f28

Details

Crawler spends ages crawling all member profiles, including many many members which has no contribution and nothing interesting in their profile.

I've reconfigured the recipe to filter out member pages https://cheatography.com/members/ (and all subpages) and followers https://cheatography.com/<username>/followers/. With this I expect we will only scraper members who have submitted something to the site (at least one comment on one cheat sheet). I hope it will make the crawling finish way faster.

benoit74 commented 1 week ago

Edit: followers where already excluded in fact

benoit74 commented 2 days ago

Last task succeeded in 5 days, 20 hours, 60 minutes. There are 215265 records. ZIM is viewable at https://dev.library.kiwix.org/viewer#cheatography.com_en_all_2024-06/

From these 215265 records, 69575 could be saved with a fuzzyrule to ignore timestamp in cheatography.com/scripts/useful.min.js?v=2&q=1719224193 (all requests have a different q value but return the same JS, so we duplicated the same JS file 69575 times).

We have something like 68 Youtube videos included, see e.g. https://cheatography.com/sparkledaisy/cheat-sheets/ethics-dsst/. Not sure we really want to grab these videos, good question.

A custom CSS to hide all the things highlighted in grew below would help a lot:

cheatography com-1-amended

cheatography com-2-amended

cheatography com-3-amended

cheatography com-4-amended

Do you see anything else to fix/hide? Nota: I've disabled the recipe for now, I will create the custom CSS once we're on the same page + will probably run next task from artifacts to avoid scraping the website again

Popolechien commented 2 days ago

Do you see anything else to fix/hide?

Nope. LGTM

RavanJAltaie commented 20 hours ago

I went through the file in Dev library, LGTM as well.