openzim / zim-requests

Want a new ZIM file? Propose ZIM content improvements or fixes? Here you are!
https://farm.openzim.org
37 stars 2 forks source link

New request: radiozamaneh.com #830

Open benoit74 opened 6 months ago

benoit74 commented 6 months ago

This is a subtask of #826 for tracking recipe progress one by one and avoid confusion.

Recipe already created here: https://farm.openzim.org/recipes/radiozamaneh.com_persian

benoit74 commented 6 months ago

The recipe failed, it produced only a 3.6MB ZIM.

Looking into the log, it looks like only the first page (homepage) loaded properly and all subsequent requests have been blocked, at least they all returned HTTP 400 error (Bad Request) while they are working online.

As mentioned upstream, the website is protected by Deflect.ca which seems to be prompt to block us.

@Popolechien @RavanJAltaie We should contact website owner (via our contacts) to check if it would be possible to have a whitelisting of our ondemand worker (public IPs are 92.243.27.71 and 2001:4b98:dc0:43:f816:3eff:fe32:84fc/64).

benoit74 commented 6 months ago

FYI, it looks like newest browsertrix crawler 1.0.0-beta.3 seems to be less impacted by the situation ; I wonder if we should update zimit2 image to use this new crawler even if running in beta

RavanJAltaie commented 6 months ago

@benoit74 I'll discuss this with Stephane today.

kelson42 commented 5 months ago

@RavanJAltaie @Popolechien Any feedback about @benoit74's request of whitelisting?

benoit74 commented 5 months ago

@kelson42 we decided to not whitelist for now, looks like it might not be needed with new browsertrix crawler 1.0, task is running since 10 days and almost complete

benoit74 commented 3 months ago

Just uploaded a new WARC which is supposed to be complete at https://tmp.kiwix.org/ci/test-warc/radiozamaneh.com_2024-05-14/radiozamaneh_20240514.tar

benoit74 commented 3 months ago

Custom CSS is ready at https://drive.farm.openzim.org/zimit_custom_css/www.radiozamaneh.com.css

benoit74 commented 3 months ago

WARC seems to be pretty good, conversion to ZIM found "only" 1866 unique broken links on www.radiozamaneh.com domain (and I checked few of them - most folllow same pattern) and they are all broken on source website as well

Could be either a rewriting error (HTML source code not properly interpreted, not likely, too few items from my PoV) or real issues in source website (more likely).

I'm currently running again the ZIM creation with custom CSS

benoit74 commented 3 months ago

New zimit2 ZIM is available at https://dev.library.kiwix.org/viewer#radiozamaneh-com_far_all_2024-05/www.radiozamaneh.com/ or searchable with https://dev.library.kiwix.org/#lang=&q=%D9%85%D8%B3%D8%AA%D9%82%D9%84

benoit74 commented 3 months ago

I just found two new issues:

image

image

Popolechien commented 3 months ago

Ah, I was going to say that it looks pretty good to me.

benoit74 commented 3 months ago

It is still pretty very good from my PoV ^^

benoit74 commented 3 months ago

A new ZIM is currently being built at https://farm.openzim.org/pipeline/f3908653-bff1-407f-95b0-4c2f698d3bd6 with latest scraper version and custom CSS, expecting to produce adequate ZIM from end-to-end this time

benoit74 commented 3 months ago

Looks like it succeeded to produce a good ZIM, @Popolechien please review and transfer to client if you are happy as well, or speak up about remaining issues needing a fix:

https://dev.library.kiwix.org/#lang=&q=%DA%AF%D8%B2%D8%A7%D8%B1%D8%B4%DA%AF%D8%B1%DB%8C%D9%90

benoit74 commented 2 months ago

As mentioned in https://github.com/openzim/zimit/issues/339, (some) videos seems to not be working on Chrome browser

benoit74 commented 1 month ago

@Popolechien I begin to see errors in Zimfarm logs linked to Cloudflare blocking some requests. Can we contact the website owner to be whitelisted just like we did for iranwire?

Popolechien commented 1 month ago

Let me ask.

benoit74 commented 1 day ago

ZIM is ready in dev library, moved to prod