openzim / warc2zim

Command line tool to convert a file in the WARC format to a file in the ZIM format
https://pypi.org/project/warc2zim/
GNU General Public License v3.0
40 stars 5 forks source link

Some resources rewrite mode are not correctly identified #326

Closed benoit74 closed 4 days ago

benoit74 commented 1 week ago

See https://dev.library.kiwix.org/viewer#www.mankier.com_en_all_2024-06/www.mankier.com/ where all URLs are broken, they are still absolute URLs.

Same problem on https://dev.library.kiwix.org/viewer#footballdatabase_2024-06/footballdatabase.com/ranking/europe/1

Once fixed, please update https://github.com/openzim/zim-requests/issues/171 and https://github.com/openzim/zim-requests/issues/68

kelson42 commented 1 week ago

@benoit74 Sounds a very bad regression. Don't save time on hardening the testing around URYl rewritting.

benoit74 commented 1 week ago

I don't think it is a regression. I didn't investigated a lot, but it more looks like a subtle corner case we did not encountered before. Most pages of mankier ZIM are OK for instance, only this homepage is broken. And we do have lots of automated and manual testing around all this URL rewriting, both in JS and in Python indeed.

benoit74 commented 5 days ago

Problem on mankier is that the page has not been rewritten as HTML at all:

[warc2zim::2024-06-20 02:51:30,510] WARNING:Rewrite mode has changed in 2.0.1 for www.mankier.com/ record: was html, now is None (mimetype: text/html, resourcetype: xhr)

Same problem on football database (and there are many other pages in the same situation):

[warc2zim::2024-06-16 17:45:28,656] WARNING:Rewrite mode has changed in 2.0.1 for footballdatabase.com/ranking/europe/1 record: was html, now is None (mimetype: text/html, resourcetype: xhr)

I don't know why the resourcetype is considered as xhr in both cases...

There are also other problems:

[warc2zim::2024-06-20 02:51:28,422] WARNING:Rewrite mode has changed in 2.0.1 for www.mankier.com/manifest.json record: was json, now is None (mimetype: application/json, resourcetype: other)
[warc2zim::2024-06-20 02:51:30,507] WARNING:Rewrite mode has changed in 2.0.1 for www.mankier.com/sw-1.js record: was javascript, now is None (mimetype: text/javascript, resourcetype: other)
[warc2zim::2024-06-16 17:45:28,632] WARNING:Rewrite mode has changed in 2.0.1 for footballdatabase.com/sw.js record: was javascript, now is None (mimetype: application/javascript, resourcetype: other)
[warc2zim::2024-06-16 17:45:28,632] WARNING:Rewrite mode has changed in 2.0.1 for footballdatabase.com/manifest.json record: was json, now is None (mimetype: application/json, resourcetype: manifest)

(last one is already fixed in 2.0.2, but others are not)

benoit74 commented 5 days ago

So to conclude: this has nothing to do with URL rewriting, but only with rewriting. Rewriting has been hardened in 2.0.1 with https://github.com/openzim/warc2zim/issues/296 to avoid corner cases with lying mime content types. Obviously we are still in a stabilization phase, where before 2.0.1 we were rewriting too much / documents which were not supposed to be rewritten. We are now rewriting a little not enough.

Still hard to understand what drives the browser to consider one resource type more than another one ... I'm not sure we should implement https://github.com/openzim/warc2zim/issues/309 in 2.2 ; this is producing very little logs in both cases ... but always on purpose. Maybe we should keep it forever.