openzim / zim-requests

Want a new ZIM file? Propose ZIM content improvements or fixes? Here you are!
https://farm.openzim.org
40 stars 2 forks source link

New request: womenshistory.si.edu #1121

Open benoit74 opened 2 months ago

benoit74 commented 2 months ago

This is a "fake" new request to track the fact that we (@kelson42 and I) are building a demo zimit ZIM of womenshistory.si.edu ; this issue will also help to track choices we've made in the recipe configuration

benoit74 commented 2 months ago

Recipe URL : https://farm.openzim.org/recipes/womenshistory.si.edu

Exclude so far: https?:\/\/womenshistory\.si\.edu(?:\/es\/|\/contact-us|\/search|\/object|.*\?.*edan_fq) ; goal is to exclude:

benoit74 commented 2 months ago

ZIM seems to be pretty OK on dev library: https://dev.library.kiwix.org/viewer#womenshistory.si.edu_en_all_2024-07

Only significant concern I've found so far is that Youtube videos are not present, see e.g. https://dev.library.kiwix.org/viewer#womenshistory.si.edu_en_all_2024-07/womenshistory.si.edu/blog/gold-standard-how-these-iconic-olympic-athletes-inspired-and-united-us. It looks like they have not even been fetched by the crawler, but even if they were I think it would not work because they are embedded with a special url like https://womenshistory.si.edu/media/oembed?url=https%3A//www.youtube.com/watch%3Fv%3D6l7OxP67XSc&max_width=1280&max_height=720&hash=ZPHuxNt5R3L87vqLbN-Ub0XypraFbUX0cASUJv_mTjg which itself embed the Youtube player iframe. This looks like specially crafted backend URL, looks like we could rewrite it directly to the youtube fuzzy replay URL, tbc.

benoit74 commented 2 months ago

Problem of Youtube tracked upstream now: https://github.com/openzim/zimit/issues/360

Popolechien commented 2 months ago

Found another problem here: https://dev.library.kiwix.org/viewer#womenshistory.si.edu_en_all_2024-07/womenshistory.si.edu/events/mysterious-mrs-nixon (taken from https://womenshistory.si.edu/events/mysterious-mrs-nixon). Text is entirely missing.

Popolechien commented 2 months ago

Well technically I see a tiny bit showing up image

benoit74 commented 2 months ago

Well spotted @Popolechien ! Looking at the HTML of this page, upstream issue is most probably https://github.com/openzim/warc2zim/issues/293

benoit74 commented 2 months ago

Bug on "mrs-nixon" page is in fact a bit different, I've opened a dedicated issue: https://github.com/openzim/warc2zim/issues/364

benoit74 commented 2 months ago

Youtube issue is in fact in warc2zim: https://github.com/openzim/warc2zim/issues/316

benoit74 commented 2 months ago

ZIM is more or less fixed, at least Youtube videos work (don't really know why) at https://mirror.download.kiwix.org/zim/.hidden/dev/womenshistory.si.edu_en_all_2024-08.zim

Still multiple issues are visible when you wander around the ZIM (e.g. blog pages are not displaying ... but looks like this is mostly a JS issue of the original website trying to make a POST request when we click the button ... not easily fixable)