Open benoit74 opened 1 week ago
Edit: followers where already excluded in fact
Last task succeeded in 5 days, 20 hours, 60 minutes. There are 215265 records. ZIM is viewable at https://dev.library.kiwix.org/viewer#cheatography.com_en_all_2024-06/
From these 215265 records, 69575 could be saved with a fuzzyrule to ignore timestamp in cheatography.com/scripts/useful.min.js?v=2&q=1719224193
(all requests have a different q
value but return the same JS, so we duplicated the same JS file 69575 times).
We have something like 68 Youtube videos included, see e.g. https://cheatography.com/sparkledaisy/cheat-sheets/ethics-dsst/. Not sure we really want to grab these videos, good question.
A custom CSS to hide all the things highlighted in grew below would help a lot:
Do you see anything else to fix/hide? Nota: I've disabled the recipe for now, I will create the custom CSS once we're on the same page + will probably run next task from artifacts to avoid scraping the website again
Do you see anything else to fix/hide?
Nope. LGTM
I went through the file in Dev library, LGTM as well.
Recipe URL
https://farm.openzim.org/recipes/cheatography.com_en_all
Task URL
https://farm.openzim.org/pipeline/0dcc313a-243c-4431-8839-372500323f28
Details
Crawler spends ages crawling all member profiles, including many many members which has no contribution and nothing interesting in their profile.
I've reconfigured the recipe to filter out member pages
https://cheatography.com/members/
(and all subpages) and followershttps://cheatography.com/<username>/followers/
. With this I expect we will only scraper members who have submitted something to the site (at least one comment on one cheat sheet). I hope it will make the crawling finish way faster.