openzim / zimit

Make a ZIM file from any Web site and surf offline!
GNU General Public License v3.0
359 stars 25 forks source link

Fixing the format of harmony project zim file #258

Closed RavanJAltaie closed 10 months ago

RavanJAltaie commented 11 months ago

The harmony project file in library has a weird format, the musical sheets are all scraped successfully but the format as per the photo below needs to be fixed. image

The recipe is here

benoit74 commented 11 months ago

Could you please share the link to the library?

RavanJAltaie commented 11 months ago

https://library.kiwix.org/viewer#harmony-project.org_en_all_2023-09/A/www.harmony-project.org/music-library

benoit74 commented 11 months ago

It definitely has no impact here, but I don't get why you've kept the CSS I've developed for army pub in this recipe as well. Is this just a bad clone issue ? You really need to pay attention to all these details. I get that same parameters might seem a bit mysterious, but this customCSS parameter should be well known to you. Even if you do not know how to write the proper CSS code, you must know that this parameter is used to specify the CSS file I develop to remove search bars and stuff like this. You must not reuse them across recipes (at best they won't work at all, or they could even have unwanted side-effects

That been said, is it correct that you first ran the recipe in /.hidden/dev and it produced this file: https://dev.library.kiwix.org/viewer#harmony-project.org_en_all_2023-09/A/www.harmony-project.org/music-library

I did not found any difference in terms of Zimfarm recipe settings between the two ZIMs, did you changed something I missed?

benoit74 commented 11 months ago

The links you gave are not the correct home pages from this ZIM, at it seems to be part of the issue.

The correct ones (you get them when you search for the ZIM in the search engine on the library home page) are:

On my machine, when I open these home pages first and then browse to any page it works well. If I use your "wrong" URL then everything is fucked-up.

@RavanJAltaie could you please test both links provided above

@mgautierfr @kelson42 @rgaudin do you have any kind of explanation? do you think this is worth investigating further? or should we wait for zimit2 and include this recipe in our tests?

RavanJAltaie commented 11 months ago

@benoit74 for my browser, the first link has the format problem, the second link works fine and format is ok with no problems.

mgautierfr commented 11 months ago

@mgautierfr @kelson42 @rgaudin do you have any kind of explanation? do you think this is worth investigating further? or should we wait for zimit2 and include this recipe in our tests?

I strongly suspect https://github.com/openzim/warc2zim/issues/109

benoit74 commented 11 months ago

Indeed, it looks exactly the same! I propose to wait for zimit 2 which is not that far in the future anyway.

RavanJAltaie commented 11 months ago

@Popolechien do you conform we keep this issue pending till the zimit 2 update?

Popolechien commented 11 months ago

@RavanJAltaie Yup, no choice I believe.

Jaifroid commented 11 months ago

Umm, it's working fine in Kiwix JS using standard Replay software:

image

I was about to report another issue about this ZIM, which is that images are only scraped in mobile resolution, and many of them are not in the ZIM at desktop resolution. That's clearly a Zimit issue IMHO, but not sure we can do much about it.

Jaifroid commented 11 months ago

And the mobile vs desktop resolution could easily be the cause of some of the formatting issues, given that the underlying HTML/JS in this ZIM seems to be coded to switch CSS and images according to screen resolution. Since Zimit doesn't change resolution on each page (and how could it, really?), that's going to cause issues if the ZIM is viewed at a different resolution than it was scraped at, because assets required won't be fetched.

rgaudin commented 11 months ago

@Jaifroid there's a device param on zimit that tweaks the crawling resolution

Jaifroid commented 11 months ago

@rgaudin OK, that's interesting. What I notice in this ZIM, as I've been using it a lot for testing, is that some of the big "banner" images that rotate as a slide show only show on mobile / narrow-screen resolution. Conversely, the small sheet-music thumbnails on the home page only show in desktop / wide-screen resolution. This implies that to get an exact replay, the user would need to set their resolution to the same as that of the scraper.

If the same applies to other assets, it could easily account for the weirdness (though I'm not discounting the issue identified by @mgautierfr). It's compounded by the browser's cache, which will cache assets as empty or not-found, so they might not be fetched again with resolution changes.

This seems like an underlying issue with Replay recording that would be hard to code for unless each page were scraped with dynamic resolution changes during the scrape. @ikreymer, can you advise us? It must be an issue faced by archivists...

But to sum up, from extensive testing I've done on this ZIM, I would say it's not broken in itself, and it's perfectly useable unless the user is unlucky with screen resolution.

benoit74 commented 11 months ago

Regarding the image resolution, I don't get why browsertrix crawler is not using the right one since we do not set the device param mentionned by @rgaudin.

You can test it very easily with this command (will scrape only the home page, but that's way enough to see that all sheet-music thumbnails are not at the correct resolution):

docker run -v $PWD:/output --name zimit_harmony-project.org_en_all --rm ghcr.io/openzim/zimit:1.6.2 zimit --url="https://www.harmony-project.org/music-library" --lang="eng" --name="harmony-project.org_en_all" --title="Sheet Music - Harmony Project" --output="/output" --verbose --adminEmail="contact+zimfarm@kiwix.org" --description="Harmony Project’s music library" --statsFilename="/output/task_progress.json" --depth 0
rgaudin commented 11 months ago

Regarding the image resolution, I don't get why browsertrix crawler is not using the right one since we do not set the device param mentionned by @rgaudin.

It's using the default profile which have its own resolution. I meant it can be changed if you wanted a different one but given the behavior of this website it probably wont help much: having a different resolution working but not all.

benoit74 commented 10 months ago

I just launched the recipe with zimit2 and it failed with 401 because the URL is not accessible anymore.

https://farm.openzim.org/pipeline/21c1b228-8115-4312-878d-d51079d8af79

Closing this issue since the requester anyway told us this ZIM is not needed indeed: https://github.com/openzim/zim-requests/issues/600#issuecomment-1889695639

Jaifroid commented 10 months ago

They seem to have password-protected the music library. I wonder if we ever had permission to scrape? Seems odd, especially as the music is supposed to be public domain.