Open benoit74 opened 6 months ago
The recipe failed, it produced only a 3.6MB ZIM.
Looking into the log, it looks like only the first page (homepage) loaded properly and all subsequent requests have been blocked, at least they all returned HTTP 400 error (Bad Request) while they are working online.
As mentioned upstream, the website is protected by Deflect.ca which seems to be prompt to block us.
@Popolechien @RavanJAltaie We should contact website owner (via our contacts) to check if it would be possible to have a whitelisting of our ondemand
worker (public IPs are 92.243.27.71
and 2001:4b98:dc0:43:f816:3eff:fe32:84fc/64
).
FYI, it looks like newest browsertrix crawler 1.0.0-beta.3 seems to be less impacted by the situation ; I wonder if we should update zimit2 image to use this new crawler even if running in beta
@benoit74 I'll discuss this with Stephane today.
@RavanJAltaie @Popolechien Any feedback about @benoit74's request of whitelisting?
@kelson42 we decided to not whitelist for now, looks like it might not be needed with new browsertrix crawler 1.0, task is running since 10 days and almost complete
Just uploaded a new WARC which is supposed to be complete at https://tmp.kiwix.org/ci/test-warc/radiozamaneh.com_2024-05-14/radiozamaneh_20240514.tar
Custom CSS is ready at https://drive.farm.openzim.org/zimit_custom_css/www.radiozamaneh.com.css
WARC seems to be pretty good, conversion to ZIM found "only" 1866 unique broken links on www.radiozamaneh.com domain (and I checked few of them - most folllow same pattern) and they are all broken on source website as well
Could be either a rewriting error (HTML source code not properly interpreted, not likely, too few items from my PoV) or real issues in source website (more likely).
I'm currently running again the ZIM creation with custom CSS
New zimit2 ZIM is available at https://dev.library.kiwix.org/viewer#radiozamaneh-com_far_all_2024-05/www.radiozamaneh.com/ or searchable with https://dev.library.kiwix.org/#lang=&q=%D9%85%D8%B3%D8%AA%D9%82%D9%84
I just found two new issues:
Ah, I was going to say that it looks pretty good to me.
It is still pretty very good from my PoV ^^
A new ZIM is currently being built at https://farm.openzim.org/pipeline/f3908653-bff1-407f-95b0-4c2f698d3bd6 with latest scraper version and custom CSS, expecting to produce adequate ZIM from end-to-end this time
Looks like it succeeded to produce a good ZIM, @Popolechien please review and transfer to client if you are happy as well, or speak up about remaining issues needing a fix:
https://dev.library.kiwix.org/#lang=&q=%DA%AF%D8%B2%D8%A7%D8%B1%D8%B4%DA%AF%D8%B1%DB%8C%D9%90
As mentioned in https://github.com/openzim/zimit/issues/339, (some) videos seems to not be working on Chrome browser
@Popolechien I begin to see errors in Zimfarm logs linked to Cloudflare blocking some requests. Can we contact the website owner to be whitelisted just like we did for iranwire?
Let me ask.
ZIM is ready in dev library, moved to prod
This is a subtask of #826 for tracking recipe progress one by one and avoid confusion.
Recipe already created here: https://farm.openzim.org/recipes/radiozamaneh.com_persian