openzim / zim-requests

Want a new ZIM file? Propose ZIM content improvements or fixes? Here you are!
https://farm.openzim.org
42 stars 3 forks source link

New request: Plants for a Future #745

Open Samandamand opened 1 year ago

Samandamand commented 1 year ago
RavanJAltaie commented 1 year ago

Recipe created https://farm.openzim.org/recipes/pfaf.org_en_all Will update the library link once ready

RavanJAltaie commented 12 months ago

The recipe has a problem, we are troubleshooting here https://github.com/openzim/zim-requests/issues/1009

Samandamand commented 7 months ago

I don't know if this is any help but this person seems to have scraped the data from PFAF.org https://github.com/saulshanabrook/pfaf-data

RavanJAltaie commented 6 months ago

@benoit74 Does this make any progress to our case? We had a problem that the website is using cloudflare

benoit74 commented 6 months ago

Yes, this is a great news because this means we could definitely create a custom scraper. But as you know, this is a significant effort.

I suggest to wait for zimit2 and check what happens with this new scraper version, maybe it will work way better regarding Cloudflare (it proved to in some cases).

If it doesn't work better, then we will tag this issue "scraper needed" and wait for funding / an external contributor / ...

benoit74 commented 5 months ago

I tried to run this again with Zimit2 and Crawler 1.x

Task is at https://farm.openzim.org/pipeline/7a856c5a-6dd4-459e-98b1-7f3fae593ab1

ZIM is at https://dev.library.kiwix.org/viewer#pfaf.org_en_all_2024-05 or https://dev.library.kiwix.org/#lang=eng&q=pfaf or https://mirror.download.kiwix.org/zim/.hidden/dev/pfaf.org_en_all_2024-05.zim

This time the crawling worked, but it makes me realize that website is highly dynamic and mainly centered around a search engine. We cannot hence scrape all plants successfully and the UI is mostly broken since the search relies on an online server which won't work inside the ZIM (EDIT: with zimit scraper).

Since it looks like we might benefit from another source of information to retrieve the DB, I propose to tag this "scraper_needed" and to wait (could take a long time, current lead time is ... years) for someone to create a scraper for this website, either by using existing github repo or by crawling the website with specific code.

benoit74 commented 5 months ago

edit previous comment: limitation on search centered UI concerns only zimit scraper, not a custom-made scraper

Samandamand commented 6 days ago

Is it possible to zim the data from here instead since they already did the scraping? https://lite.datasette.io/?url=https://saulshanabrook.github.io/pfaf-data/data.sqlite#/data/plant_data

benoit74 commented 5 days ago

Indeed, a custom scraper could benefit from using https://saulshanabrook.github.io/pfaf-data/ dataset

benoit74 commented 5 days ago

But we cannot ZIM the URL you've provided with Zimit, it is not going to work unfortunately. Maybe setting-up a simple website working only client-side with the sqlite database of https://saulshanabrook.github.io/pfaf-data/ and then running zimit on it would a quick win.