Open benoit74 opened 2 months ago
Recipe URL : https://farm.openzim.org/recipes/womenshistory.si.edu
Exclude so far: https?:\/\/womenshistory\.si\.edu(?:\/es\/|\/contact-us|\/search|\/object|.*\?.*edan_fq)
; goal is to exclude:
edan_fq
: this is a trick to remove facet search pages e.g. on https://womenshistory.si.edu/exhibitions ; not sure this is going to be future-proof, but did not find a better solution (yet)ZIM seems to be pretty OK on dev library: https://dev.library.kiwix.org/viewer#womenshistory.si.edu_en_all_2024-07
Only significant concern I've found so far is that Youtube videos are not present, see e.g. https://dev.library.kiwix.org/viewer#womenshistory.si.edu_en_all_2024-07/womenshistory.si.edu/blog/gold-standard-how-these-iconic-olympic-athletes-inspired-and-united-us. It looks like they have not even been fetched by the crawler, but even if they were I think it would not work because they are embedded with a special url like https://womenshistory.si.edu/media/oembed?url=https%3A//www.youtube.com/watch%3Fv%3D6l7OxP67XSc&max_width=1280&max_height=720&hash=ZPHuxNt5R3L87vqLbN-Ub0XypraFbUX0cASUJv_mTjg which itself embed the Youtube player iframe. This looks like specially crafted backend URL, looks like we could rewrite it directly to the youtube fuzzy replay URL, tbc.
Problem of Youtube tracked upstream now: https://github.com/openzim/zimit/issues/360
Found another problem here: https://dev.library.kiwix.org/viewer#womenshistory.si.edu_en_all_2024-07/womenshistory.si.edu/events/mysterious-mrs-nixon (taken from https://womenshistory.si.edu/events/mysterious-mrs-nixon). Text is entirely missing.
Well technically I see a tiny bit showing up
Well spotted @Popolechien ! Looking at the HTML of this page, upstream issue is most probably https://github.com/openzim/warc2zim/issues/293
Bug on "mrs-nixon" page is in fact a bit different, I've opened a dedicated issue: https://github.com/openzim/warc2zim/issues/364
Youtube issue is in fact in warc2zim: https://github.com/openzim/warc2zim/issues/316
ZIM is more or less fixed, at least Youtube videos work (don't really know why) at https://mirror.download.kiwix.org/zim/.hidden/dev/womenshistory.si.edu_en_all_2024-08.zim
Still multiple issues are visible when you wander around the ZIM (e.g. blog pages are not displaying ... but looks like this is mostly a JS issue of the original website trying to make a POST request when we click the button ... not easily fixable)
This is a "fake" new request to track the fact that we (@kelson42 and I) are building a demo zimit ZIM of womenshistory.si.edu ; this issue will also help to track choices we've made in the recipe configuration