Closed benoit74 closed 4 days ago
@benoit74 Sounds a very bad regression. Don't save time on hardening the testing around URYl rewritting.
I don't think it is a regression. I didn't investigated a lot, but it more looks like a subtle corner case we did not encountered before. Most pages of mankier ZIM are OK for instance, only this homepage is broken. And we do have lots of automated and manual testing around all this URL rewriting, both in JS and in Python indeed.
Problem on mankier is that the page has not been rewritten as HTML at all:
[warc2zim::2024-06-20 02:51:30,510] WARNING:Rewrite mode has changed in 2.0.1 for www.mankier.com/ record: was html, now is None (mimetype: text/html, resourcetype: xhr)
Same problem on football database (and there are many other pages in the same situation):
[warc2zim::2024-06-16 17:45:28,656] WARNING:Rewrite mode has changed in 2.0.1 for footballdatabase.com/ranking/europe/1 record: was html, now is None (mimetype: text/html, resourcetype: xhr)
I don't know why the resourcetype is considered as xhr
in both cases...
There are also other problems:
[warc2zim::2024-06-20 02:51:28,422] WARNING:Rewrite mode has changed in 2.0.1 for www.mankier.com/manifest.json record: was json, now is None (mimetype: application/json, resourcetype: other)
[warc2zim::2024-06-20 02:51:30,507] WARNING:Rewrite mode has changed in 2.0.1 for www.mankier.com/sw-1.js record: was javascript, now is None (mimetype: text/javascript, resourcetype: other)
[warc2zim::2024-06-16 17:45:28,632] WARNING:Rewrite mode has changed in 2.0.1 for footballdatabase.com/sw.js record: was javascript, now is None (mimetype: application/javascript, resourcetype: other)
[warc2zim::2024-06-16 17:45:28,632] WARNING:Rewrite mode has changed in 2.0.1 for footballdatabase.com/manifest.json record: was json, now is None (mimetype: application/json, resourcetype: manifest)
(last one is already fixed in 2.0.2, but others are not)
So to conclude: this has nothing to do with URL rewriting, but only with rewriting. Rewriting has been hardened in 2.0.1 with https://github.com/openzim/warc2zim/issues/296 to avoid corner cases with lying mime content types. Obviously we are still in a stabilization phase, where before 2.0.1 we were rewriting too much / documents which were not supposed to be rewritten. We are now rewriting a little not enough.
Still hard to understand what drives the browser to consider one resource type more than another one ... I'm not sure we should implement https://github.com/openzim/warc2zim/issues/309 in 2.2 ; this is producing very little logs in both cases ... but always on purpose. Maybe we should keep it forever.
See https://dev.library.kiwix.org/viewer#www.mankier.com_en_all_2024-06/www.mankier.com/ where all URLs are broken, they are still absolute URLs.
Same problem on https://dev.library.kiwix.org/viewer#footballdatabase_2024-06/footballdatabase.com/ranking/europe/1
Once fixed, please update https://github.com/openzim/zim-requests/issues/171 and https://github.com/openzim/zim-requests/issues/68