openzim / warc2zim

Command line tool to convert a file in the WARC format to a file in the ZIM format
https://pypi.org/project/warc2zim/
GNU General Public License v3.0
40 stars 5 forks source link

Double slash in query string (and path) are not working properly #316

Open benoit74 opened 2 weeks ago

benoit74 commented 2 weeks ago

URL rewriting is not working properly when we have double / in query string.

E.g. if HTML document URL is https://www.example.com/index.html?url=https://www.example.com/foo, then everything is broken on the page in Kiwix-server due to improper URL rewriting / conflicting browser behavior / ...

The problem is that for such an original HTML document URL, the ZIM Path is www.example.com/index.html?url=https://www.example.com/foo and the URL to load this in kiwix-serve is http://yourserver/content/yourzim/www.example.com/index.html%3Furl%3Dhttps%3A//www.example.com/foo

From a browser/server perspective, the query string is dropped (this is normal for proper operation in all readers) and we hence end-up with a double slash //.

This is not a promising URL because how relative URLs are computed given this double slash is obviously not standardised. Or at least Python (used to compute relative path in scraper) and Firefox (used to display the ZIM) have different PoV on this. And we can be sure we will have very varied ways to interpret this URL and relative URLS inside the document in our various readers.

I hence recommend that when normalizing URLs into a ZIM Path, we get rid of any // (or /// and so on) and simplify them to /. This comes with a minimal risk of collision we can most probably assume is negligible. It would also break some websites which uses the query string value in JS code and really need the proper value. We'll see how often this happens.

Nota: An alternative could have been to keep same ZIM Path and change the way we transform ZIM Path into URL by saying that a ZIM path like foo//bar should lead to a URL foo/%2Fbar. This would ensure we keep the same look to URLs displayed in browser bar (%2F will continue to be displayed as /) and we minimize the risk of collisions (we keep same number of characters). The big problem with this approach is that since the conversion can only be done when we transform the Path into a URL (we cannot change the ZIM Path, encoded chars should not be present inside the ZIM Path), then it means that all computation of relative links which are based on the ZIM Path are not working anymore and also have to be adapted ... which seems overly complex.

Nota: this bug has been observed in the wild on https://edu.gcfglobal.org/pt/criar-um-correio-eletronico/como-funciona-um-correio-eletronico/1/# where we have an iFrame to https://support.gcfglobal.org/form/?url=https://edu.gcfglobal.org/pt/criar-um-correio-eletronico/como-funciona-um-correio-eletronico/1/

benoit74 commented 2 weeks ago

Nota: real world example of edu.gcfglobals.org tend to prove my suggestion might not be the best fit: iFrame is anyway useless (should be hidden / removed) + result is not going to work anyway since URL has been broken. Maybe we should simply use a fuzzy rule to simplify path and not store too many items for nothing.