Closed RavanJAltaie closed 10 months ago
Could you please share the link to the library?
It definitely has no impact here, but I don't get why you've kept the CSS I've developed for army pub in this recipe as well. Is this just a bad clone issue ? You really need to pay attention to all these details. I get that same parameters might seem a bit mysterious, but this customCSS parameter should be well known to you. Even if you do not know how to write the proper CSS code, you must know that this parameter is used to specify the CSS file I develop to remove search bars and stuff like this. You must not reuse them across recipes (at best they won't work at all, or they could even have unwanted side-effects
That been said, is it correct that you first ran the recipe in /.hidden/dev and it produced this file: https://dev.library.kiwix.org/viewer#harmony-project.org_en_all_2023-09/A/www.harmony-project.org/music-library
I did not found any difference in terms of Zimfarm recipe settings between the two ZIMs, did you changed something I missed?
The links you gave are not the correct home pages from this ZIM, at it seems to be part of the issue.
The correct ones (you get them when you search for the ZIM in the search engine on the library home page) are:
On my machine, when I open these home pages first and then browse to any page it works well. If I use your "wrong" URL then everything is fucked-up.
@RavanJAltaie could you please test both links provided above
@mgautierfr @kelson42 @rgaudin do you have any kind of explanation? do you think this is worth investigating further? or should we wait for zimit2 and include this recipe in our tests?
@benoit74 for my browser, the first link has the format problem, the second link works fine and format is ok with no problems.
@mgautierfr @kelson42 @rgaudin do you have any kind of explanation? do you think this is worth investigating further? or should we wait for zimit2 and include this recipe in our tests?
I strongly suspect https://github.com/openzim/warc2zim/issues/109
Indeed, it looks exactly the same! I propose to wait for zimit 2 which is not that far in the future anyway.
@Popolechien do you conform we keep this issue pending till the zimit 2 update?
@RavanJAltaie Yup, no choice I believe.
Umm, it's working fine in Kiwix JS using standard Replay software:
I was about to report another issue about this ZIM, which is that images are only scraped in mobile resolution, and many of them are not in the ZIM at desktop resolution. That's clearly a Zimit issue IMHO, but not sure we can do much about it.
And the mobile vs desktop resolution could easily be the cause of some of the formatting issues, given that the underlying HTML/JS in this ZIM seems to be coded to switch CSS and images according to screen resolution. Since Zimit doesn't change resolution on each page (and how could it, really?), that's going to cause issues if the ZIM is viewed at a different resolution than it was scraped at, because assets required won't be fetched.
@Jaifroid there's a device param on zimit that tweaks the crawling resolution
@rgaudin OK, that's interesting. What I notice in this ZIM, as I've been using it a lot for testing, is that some of the big "banner" images that rotate as a slide show only show on mobile / narrow-screen resolution. Conversely, the small sheet-music thumbnails on the home page only show in desktop / wide-screen resolution. This implies that to get an exact replay, the user would need to set their resolution to the same as that of the scraper.
If the same applies to other assets, it could easily account for the weirdness (though I'm not discounting the issue identified by @mgautierfr). It's compounded by the browser's cache, which will cache assets as empty or not-found, so they might not be fetched again with resolution changes.
This seems like an underlying issue with Replay recording that would be hard to code for unless each page were scraped with dynamic resolution changes during the scrape. @ikreymer, can you advise us? It must be an issue faced by archivists...
But to sum up, from extensive testing I've done on this ZIM, I would say it's not broken in itself, and it's perfectly useable unless the user is unlucky with screen resolution.
Regarding the image resolution, I don't get why browsertrix crawler is not using the right one since we do not set the device param mentionned by @rgaudin.
You can test it very easily with this command (will scrape only the home page, but that's way enough to see that all sheet-music thumbnails are not at the correct resolution):
docker run -v $PWD:/output --name zimit_harmony-project.org_en_all --rm ghcr.io/openzim/zimit:1.6.2 zimit --url="https://www.harmony-project.org/music-library" --lang="eng" --name="harmony-project.org_en_all" --title="Sheet Music - Harmony Project" --output="/output" --verbose --adminEmail="contact+zimfarm@kiwix.org" --description="Harmony Project’s music library" --statsFilename="/output/task_progress.json" --depth 0
Regarding the image resolution, I don't get why browsertrix crawler is not using the right one since we do not set the device param mentionned by @rgaudin.
It's using the default profile which have its own resolution. I meant it can be changed if you wanted a different one but given the behavior of this website it probably wont help much: having a different resolution working but not all.
I just launched the recipe with zimit2 and it failed with 401
because the URL is not accessible anymore.
https://farm.openzim.org/pipeline/21c1b228-8115-4312-878d-d51079d8af79
Closing this issue since the requester anyway told us this ZIM is not needed indeed: https://github.com/openzim/zim-requests/issues/600#issuecomment-1889695639
They seem to have password-protected the music library. I wonder if we ever had permission to scrape? Seems odd, especially as the music is supposed to be public domain.
The harmony project file in library has a weird format, the musical sheets are all scraped successfully but the format as per the photo below needs to be fixed.
The recipe is here