Open Samandamand opened 1 year ago
Recipe created https://farm.openzim.org/recipes/pfaf.org_en_all Will update the library link once ready
The recipe has a problem, we are troubleshooting here https://github.com/openzim/zim-requests/issues/1009
I don't know if this is any help but this person seems to have scraped the data from PFAF.org https://github.com/saulshanabrook/pfaf-data
@benoit74 Does this make any progress to our case? We had a problem that the website is using cloudflare
Yes, this is a great news because this means we could definitely create a custom scraper. But as you know, this is a significant effort.
I suggest to wait for zimit2 and check what happens with this new scraper version, maybe it will work way better regarding Cloudflare (it proved to in some cases).
If it doesn't work better, then we will tag this issue "scraper needed" and wait for funding / an external contributor / ...
I tried to run this again with Zimit2 and Crawler 1.x
Task is at https://farm.openzim.org/pipeline/7a856c5a-6dd4-459e-98b1-7f3fae593ab1
ZIM is at https://dev.library.kiwix.org/viewer#pfaf.org_en_all_2024-05 or https://dev.library.kiwix.org/#lang=eng&q=pfaf or https://mirror.download.kiwix.org/zim/.hidden/dev/pfaf.org_en_all_2024-05.zim
This time the crawling worked, but it makes me realize that website is highly dynamic and mainly centered around a search engine. We cannot hence scrape all plants successfully and the UI is mostly broken since the search relies on an online server which won't work inside the ZIM (EDIT: with zimit scraper).
Since it looks like we might benefit from another source of information to retrieve the DB, I propose to tag this "scraper_needed" and to wait (could take a long time, current lead time is ... years) for someone to create a scraper for this website, either by using existing github repo or by crawling the website with specific code.
edit previous comment: limitation on search centered UI concerns only zimit scraper, not a custom-made scraper
Is it possible to zim the data from here instead since they already did the scraping? https://lite.datasette.io/?url=https://saulshanabrook.github.io/pfaf-data/data.sqlite#/data/plant_data
Indeed, a custom scraper could benefit from using https://saulshanabrook.github.io/pfaf-data/ dataset
But we cannot ZIM the URL you've provided with Zimit, it is not going to work unfortunately. Maybe setting-up a simple website working only client-side with the sqlite database of https://saulshanabrook.github.io/pfaf-data/ and then running zimit on it would a quick win.