openzim / libzim

Reference implementation of the ZIM specification
https://download.openzim.org/release/libzim/
GNU General Public License v2.0
163 stars 47 forks source link

Fix ambiguities around the usage of URLs in ZIM file format #865

Closed benoit74 closed 3 months ago

benoit74 commented 4 months ago

In the ZIM file format wiki page at https://wiki.openzim.org/wiki/ZIM_file_format#URLs, we speak about URL.

This section is confusing / ambiguous because:

There is also no documentation about the fact that the path must not contain certain characters. The one I'm aware of following today live discussion is the ?. If I understood it correctly, most readers (especially kiwix-serve) will consider that everything after the ? is a querystring, and this querystring will be "absorbed" by the webserver and not passed down to libzim which will not use the querystring to fetch the proper ZIM record inside the ZIM. If correct, this should clearly be documented (is it documented somewhere else?). Is there any other character which is not allowed in the path or at least produces weird behaviors?

benoit74 commented 4 months ago

Nota: all these questions have appeared while solving https://github.com/openzim/warc2zim/issues/206

Jaifroid commented 4 months ago

I think we need a clearer focus on the querystring problem. Clearly readers handle it, and pass it to their backends / libzim, otherwise many more things would be broken. An unencoded ? in the middle of a ZIM-URL string that is nothing to do with a querystring doesn't seem to be a problem for our readers.

An example, from Wikipedia, the ZIM URL A/What_About_Me?_(Kenny_Rogers_album) (no encoding of ?). This is not a valid URL, but it is a valid ZIM URL. It corresponds directly to an article with title What About Me? (Kenny Rogers song). Kiwix Desktop has no problem opening this article (nor Kiwix PWA).

However, in https://github.com/openzim/warc2zim/issues/163#issuecomment-1914284279 @mgautierfr said 'It is normal and wanted that we encode ?. This "query string separator" for wordpress is not one for kiwix.' I believe he was referring to what is normal in warc2zim, rather than what is required for the readers.

Nevertheless, we have a slightly confusing situation. Should we have decoded ? and other separators in ZIM URLs or not?

benoit74 commented 4 months ago

Regarding the querysting problem, let's focus on your Kenny Rogers example.

If we open the Kenny Rogers page at https://library.kiwix.org/content/wikipedia_en_all_nopic_2024-02/A/Kenny_Rogers and search for "What About Me?" link, we see that the ? is encoded in the HTML document.

image

And opening this encoded URL (which must be encoded to match W3C specifications) works properly because there is a ZIM entry at A/What_About_Me?_(Kenny_Rogers_album).

In other words: this example works because there is no querystring in the URL of the HTML document, only an encoded ?.

@rgaudin just confirmed me as well that when you open a URL with a real querystring, this querystring is not passed to the part of libzim which search for the ZIM entry, it is "absorbed/dropped" by the "web server" part.

So for now it is quite clear to me that:

rgaudin commented 4 months ago

The goal of the spec/documentation is to avoid this kind of discussion. So it clearly needs to be improved.

An example, from Wikipedia, the ZIM URL A/What_AboutMe?(Kenny_Rogers_album) (no encoding of ?). This is not a valid URL, but it is a valid ZIM URL. It corresponds directly to an article with title What About Me? (Kenny Rogers song). Kiwix Desktop has no problem opening this article (nor Kiwix PWA).

This is misleading because there is no querystring here: the URL path of this article is A/What_About_Me%3F_(Kenny_Rogers_song) as you can see here and here.

Encoding is a convention so whatever we do, it should be clear, and most importantly why.

Regarding querystrings, libkiwix/server doesn't use them when looking for records. That can be justified by the fact that a resource is uniquely identified by its path and that a query string applies to a single query (hence the name).

For zimit we might need a way to store and access data for say /?page=home and /?page=about which is a frequent pattern. If we want to leave the readers as is, we can rewrite those links to a different path (and store qs-different requests on different paths – if not already the case).

Jaifroid commented 4 months ago

This is misleading because there is no querystring here: the URL path of this article is A/What_AboutMe%3F(Kenny_Rogers_song) as you can see here and here.

@rgaudin, I think you're being misled by the fact that the browser displays the URL encoded in the address bar, but the value is stored in the ZIM unencoded. Please see zimdump readout below, and note how I referenced the URL in the commandline. I never said it was a querystring, btw. I used it as an example to show that we have unencoded URL separators in normal OpenZIM ZIMs.

image

rgaudin commented 4 months ago

@Jaifroid I know how it's stored, you made that very clear.

I thought you replied to @benoit74 who mentioned the querystring issue which exists.

Jaifroid commented 4 months ago

@Jaifroid I know how it's stored, you made that very clear.

I thought you replied to @benoit74 who mentioned the querystring issue which exists.

OK. We're clearly talking at cross-purposes, and apologies for that. I was referring to ZIM URLs, not their representation in HTML (which is why I gave an example with a namespace, i.e. a path in the ZIM URL index). My point was merely that OpenZIM ZIMs follow the rule that the ZIM-URL string should be fully decoded, but the problem we're grappling with is that this doesn't appear (always?) to be the case for ZIM-URLs converted with warc2zim.

@benoit74 In your helpful explanation above, I think you are talking about what happens when a user clicks on a link (which indeed will always be encoded), and what happens to the querystring in that case. If it is verified that any (true) querystring is absorbed/dropped by the server, then that's quite serious and would probably be an issue to solve in libkiwix (?).

But the reader can clearly pass the querystring to the server in the case of dynamically fetched URLs and assets (and it is not discarded). Otherwise, pretty much nothing (let alone video) would work in most ZImit ZIMs.

I hope we're on the same page now, but let me know if not!

rgaudin commented 4 months ago

If it is verified that any (true) querystring is absorbed/dropped by the server, then that's quite serious and would probably be an issue to solve in libkiwix (?).

It is,. Look at https://github.com/kiwix/libkiwix/blob/922c138809fe936ea77797de8eb188199255e6f7/src/server/internalServer.cpp#L1116. You'll see that the query string parameters are stored in the RequestContext but not used to fetch data from the ZIM. It's done solely off the path and the path is not transformed: the prefix is removed. that's it.

But the reader can clearly pass the querystring to the server in the case of dynamically fetched URLs and assets (and it is not discarded). Otherwise, pretty much nothing (let alone video) would work in most ZImit ZIMs.

When I look at zimit1 ZIMs, I see them make requests in encoded form (using %3F). I don't know what part of the stack does this. AFAIK we were not rewriting so it would be either Wombat or the SW/replayer.

# firefox
Filename: /content/mes-quartiers-chinois_fr_all_2024-03/0.gravatar.com/avatar/%3Fs%3D80%26d%3Didenticon%26r%3DG

# kiwix-serve
full_url  : /content/mes-quartiers-chinois_fr_all_2024-03/0.gravatar.com/avatar/?s=80&d=identicon&r=G

I have no idea if or how it works for zimit2 but either this is already taken care of by Wombat (or what?) or we need to rewrite. In any case, this should be clearly documented.

Jaifroid commented 4 months ago

So, I have a zimit2 example with a true querystring in a link to be clicked by a user.

TLDR: In the mentioned ZIM, URLs with true querystrings are "incorrectly" presented as fully encoded in the HTML (i.e. the ? is represented as %3F despite being a querystring). Details below.

In the latest dev Solidarité numérique, look at https://dev.library.kiwix.org/viewer#www.solidarite-numerique.fr_fr_all_2024-03/www.solidarite-numerique.fr/ , and inspect "lire la suite" under "Comprendre les cookies". Here a true querystring is "wrongly" encoded (see screenshot). The link is given in the HTML representation as tutoriels/comprendre-les-cookies/%3Fthematique%3Dinternet. It should be tutoriels/comprendre-les-cookies/?thematique=internet (neither the ? nor the = should be encoded, as both of these are separators, not part of the encoded value).

However, it works in Kiwix Serve, probably precisely because the querystring is encoded, corroborating what you say @rgaudin. It doesn't work in Kiwix JS (neither the add-on nor the PWA), because my backend doesn't see the querystring here. The underlying ZIM-URL (as stored in the ZIM) is C/www.solidarite-numerique.fr/tutoriels/comprendre-les-cookies/?thematique=internet.

Now, I can clearly work around this issue in my KJS code, but should I? It looks like a workaround introduced to bypass the issue with Kiwix Serve's handling of querystrings.

NB I don't have such issues with zimit1. This is because Replay URLs are presented in a completely different format (with prefixes designed to be picked up by the Service Worker). So I suppose the workaround introduced here must be in the zimit2 Python transformations.

image

EDIT: On the original web page, https://www.solidarite-numerique.fr/, the querystring is presented correctly, unencoded:

image

benoit74 commented 4 months ago

I would like that we focus in this issue on the general case, not Zimit only. Discussions about what should be done in Zimit is in the other issue. And discussions in this issue should be only about what has to be done so that the specification clearly represent the behavior of main reader implementation, and if possible the implementation of other important readers like KJS.

So back to the general case, the ZIM spec says that url should be stored unencoded, in utf-8. I'm fine with it.

This means that, like we have in Wikipedia example above, A/What_About_Me?_(Kenny_Rogers_album) is a perfectly valid ZIM path.

The only way for now with libkiwix (and probably KJS, I let you confirm) to load this ZIM entry is to fetch a URL like A/What_About_Me%3F_(Kenny_Rogers_album). And I think that we have to live with it for now, i.e. a ? in a ZIM entry Path means a %3F in the URLs (either loaded from HTML source or directly from the browser address bar).

And the consequence / corollary, is that it is not allowed / possible for now to store something like a querystring in the ZIM, since current main implementation (libzim) drops these querystring.

Should an HTTP URL (in HTML document, ...) contain a querystring, this has to be dropped by the reader for now, since even if they were used to search the entry inside the ZIM (which is not the case for now, at least in libkiwix), there would be confusion between the meaning of %3F and ? since the ZIM entry path is unencoded so always with ?.

Last point, we must remove the reference to the RFC in the specification.

How we move forward the situation for Zimit will be discussed in Zimit issue. And if we all agree on my statements above, I will open another libzim issue discussing the potential need to evolve the ZIM specification / readers behaviors to be able to process querystring passed in HTTP URLs (spoiler: we can probably live without it for Zimit, at least for the next months).

Jaifroid commented 4 months ago

The only way for now with libkiwix (and probably KJS, I let you confirm) to load this ZIM entry is to fetch a URL like A/What_AboutMe%3F(Kenny_Rogers_album)

That's not how it works in KJS. In KJS we fetch exactly A/What_About_Me?_(Kenny_Rogers_album). So we do decodeURIComponent() on any URL in the HTML (e.g. in a link clicked by a user) to extract the true underlying ZIM-URL. However, this is currently failing in the Zimit example in KJS, due to querystring handling code (I have opened https://github.com/kiwix/kiwix-js/issues/1229 for that).

And the consequence / corollary, is that it is not allowed / possible for now to store something like a querystring in the ZIM, since current main implementation (libzim) drops these querystrings.

I think this inference is not quite correct 😉. We store lots of URLs with (unencoded) true querystrings in the ZIM currently and we can't get by without this. In the zimit2 example I give above, the URL is (correctly) stored in the ZIM with querystring separators that are not encoded. It is "incorrectly" presented as encoded in the HTML (not in the ZIM), and this has been done intentionally for compatibility with Kiwix Serve, it seems. But as you say, that's another issue (albeit impacting this issue).

EDIT: Look at all these examples, and also note (incidentally) that there are encoded data in these querystrings (look for the % characters), but that URL separators (/, ?, =, etc.) are not encoded.

image

benoit74 commented 4 months ago

We store lots of URLs with (unencoded) true querystrings in the ZIM currently and we can't get by without this.

All these entries you are mentioning have to be fetched with a %3F in the HTTP URL, not with a ?/querystring.

And these are scraper issues. And potentially a need to evolve the ZIM specification.

If the ZIM Path is something?somewhat, this has to be fetched from the URL something%3Fsomewhat currently. This is exactly what mwoffliner is doing and it works perfectly fine with kiwix-serve, KJS, ...

This entry cannot be fetched from the HTTP URL something?somewhat. The scraper (zimit2 here, but zimit1 has same flaw, and other scraper probably as well) is simply wrongly expected that. Both because the reference implementation is anyway dropping querystring so it won't work and because there would be two W3C valid URLs leading to the same ZIM entry, which is probably not a future-proof situation.

Jaifroid commented 4 months ago

All these entries you are mentioning have to be fetched with a %3F in the HTTP URL, not with a ?/querystring.

That's only true for readers / libkiwix that don't handle the querystring. They are stored in the ZIM with a raw ?, not a %3F, so it's definitely a workaround that they are presented "over-encoded" in the HTML. However, I have a PR to work around the workaround, if that's what's decided! I think this is only zimit2, because in zimit1 URLs are formatted very differently, in a precise way necessary for Replay worker to deal with them.

Not sure where this leaves us wrt the spec, though.

benoit74 commented 4 months ago

I have:

I've checked inside libzim, and we are unfortunately using the "URL" word and not the "Path" word, while I consider the later would definitely have been more appropriate (these URL are not valid at all from most RFCs perspectives, since they are not url encoded and hence using many reserved characters).

Please :

I know that many scrapers do not obey the convention (yet), e.g. zimit1 and zimit2, but I also "abused" the querystring "flaw" in ifixit.

I consider it is anyway way better to fix scrapers rather than modify the libzim / ZIM specification / all readers.

Especially since choices that have been made in the libzim / convention seems pretty logic and sound from my PoV and I've found no issue in zimit2 to follow the convention, PR is under way.

Jaifroid commented 4 months ago

@benoit74 Thank you very much. To be clear, you want us to check readers we're responsible for to see if the behaviour is as described for each ZIM URL on the landing page of this ZIM?

Can I just check that there are no changes to the way the ZIM is written, i.e. you haven't made any changes to warc2zim for the purposes of this demo? So I know what I'm testing against. Thanks for your hard work on this!

benoit74 commented 4 months ago

To be clear, you want us to check readers we're responsible for to see if the behaviour is as described for each ZIM URL on the landing page of this ZIM?

Yes, or readers you have access to and are keen to test ^^

Can I just check that there are no changes to the way the ZIM is written, i.e. you haven't made any changes to warc2zim for the purposes of this demo? So I know what I'm testing against. Thanks for your hard work on this!

The ZIM has been created directly from custom Python code (available in the tar.gz) + python-scraperlib 3.3.1. No special libzim / scraperlib / whatever. But please have a look as well to understand the HTML code + python code used, it is pretty straightforward to read I think. And you can even reuse the code.

Again, in this issue I do not want to focus on zimit/warc2zim case. I want to ensure we all have a real alignment and understanding of how to properly store ZIM entries and how to properly link to them from an HTML document. Without this alignment, we can't create good ZIMs which will be able to work in all kind of readers without needing many adaptations for each scraper custom behavior.

zimit/warc2zim case is only an illustration of the miss-alignment / miss-understanding we have for now. And this scraper is doomed to fail until we've resolved this situation because many special characters are possible once you start to scrape any kind of web property.

Jaifroid commented 4 months ago

Below are my tests for Kiwix JS and PWA.

CAVEAT: KJS-family apps act differently if reading an OpenZIM, zimit1 or zimit2 archive. Historically, with OpenZIM, we follow the spec in ignoring the querystring. When we added zimit1 support, we had to change that for zimit archives only (we did not want to alter the specified behaviour for OpenZIM). The changes to querystring processing apply to zimit1 and zimit2. Furthermore, we encode URLs before extracting them from the ZIM in the case of zimit1 (only). So, to be clear, if this archive were identified as zimit, then the result of the tests would be different from what is stated below.

Tests use latest released versions of Kiwix JS (v4.0.1) and Kiwix PWA (v3.1.1), numbering the 15 test cases from top to bottom. Summary: they conform to the stated actions, except that the PWA opens plain text files in a new window (there's a legacy reason for it, but it doesn't matter here). As a result the emojis in the last article are displayed with the incorrect character encoding. However, in KJS, the emojis are shown correctly. As you say, this is an edge case.

  1. KJS ✓ PWA ✓
  2. KJS ✓ PWA ✓
  3. KJS ✓ PWA ✓
  4. KJS ✓ PWA ✓
  5. KJS ✓ PWA ✓
  6. KJS X (displays empty page) PWA X (warns not found)
  7. KJS X (displays empty page) PWA X (warns not found)
  8. KJS X (displays empty page) PWA X (warns not found)
  9. KJS ✓ PWA ✓
  10. KJS ✓ PWA ✓
  11. KJS X (displays empty page) PWA X (warns not found)
  12. KJS ✓ PWA ✓
  13. KJS X (displays empty page) PWA X (warns not found)
  14. KJS ✓ PWA ✓
  15. KJS ✓ (shows emojis correctly) PWA (✓) (finds article and displays it but with wrong encoding)
benoit74 commented 4 months ago

@Jaifroid did I get it correctly if I say that "if zimit2 conforms with the specification and more specifically the same logic than used in these tests, KJS and PWA will probably not need to have any special logic to handle zimit2 ZIMs"?

Jaifroid commented 4 months ago

@Jaifroid did I get it correctly if I say that "if zimit2 conforms with the specification and more specifically the same logic than used in these tests, KJS and PWA will probably not need to have any special logic to handle zimit2 ZIMs"?

I don't think I'll need to change anything that I have not already put in the apps (or in a draft PR proposal) in order to handle zimit2 ZIMs, depending on the direction you choose to take with regard to encoding URLs in articles and/or encoding/decoding of ZIM URLs when they are stored in the ZIM. However, it's not a big issue if I do have to make adjustments.

The main sticky issue for me is user activation of links (e.g. clicking a link). In this case. we are forced to do additional processing to avoid triggering the sandbox in case a URL leads to an external site, or in case we accidentally load a PDF in the iframe (this is a sandbox/CORS violation in Chromium browsers). Hence, I need to know how links in articles will be encoded, and whether ZIM URLs will be encoded, and may need to adjust accordingly. I am hoping we can work around the difficulties outlined in my Catch 22 post, but that's not a unique issue for KJS!

mgautierfr commented 4 months ago

I think it is time to give my pov of this :)

I will take a professorial tone but I think it is better to be sure we have all the same pov, sorry about that.

Why url encoding.

We (the world) encode url for two different reasons:

For example:

Note that even if the first uri doesn't need to be encoded, it still can be (over) encoded as https://foo.org/path%40/entry?key=value, or even more as https://foo.org%2Fpath%40%2Fentry?key=value. In fact, it is valid to encode ANY char (except when they have a meaning and we want to keep that meaning). So a same url may be encoded different ways. I will call the canonical (or real) url the variant when less chars are encoded.

In zim file.

Entries in a zim file are referenced using a "key". This "key" is a array of u8, nothing more. The "key" is called path in libzim code and it was url in previous version of libzim (and still in spec) What is inside this array is in fact not relevant for the code (we treat it as raw array of u8) but is relevant if we want to have something consistent. The specification told us: utf8 encoded, not url encoded. I would say : The canonical url, where non ascii char are utf8 encoded and not url encoded.

Meaning that https://foo.org/path€/entry?key=value is stored as \x68\x74\x74\x70\x73\x3a\x2f\x2f\x66\x6f\x6f\x2e\x6f\x72\x67\x2f\x70\x61\x74\x68\xe2\x82\xac\x2f\x65\x6e\x74\x72\x79\x3f\x6b\x65\x79\x3d\x76\x61\x6c\x75\x65 (or https://foo.org/path\xe2\x82\xac/entry?key=value) That all

Serving content in a zim file.

Everything should be simple:

We should have minimal parsing of the request-uri only to extract the zim to serve (and potentially no parsing at all for single zim server).

However, we are not in theory but in real world. And server (and library) parse the request-uri for us. So things like ? are parsed for us (at least libmicrohttpd does). And globally, we want to keep this behavior. The simple solution is to make the entry's key equal to the request-uri's path. So encode request url the way it can be decoded and the decoded path (only) is equal to the entry key.

So, at request (in html) level:

In all case, server will not interpret the "hidden" ?, find the path, decode it (and so reveal the ? and ) and pass it to "us" as a utf8 char.

If somehow we store https://foo.org/path@/entry%3Fkey=value in the zim file, we have to encoded in the request as https://foo.org/path@/entry%253Fkey=value else we will not find the entry https://foo.org/path@/entry?key=value. That is the problem of https://github.com/openzim/warc2zim/issues/206 where we store a url which is not canonical.

All those thing was (and is still) valid from "the beginning". But as we were using specific scrapper and use only keys without ? (or url without query-string) this was pretty straight forward and a forgotten/extra url encoding/decoding was harmless. And the Canonical url was equivalent to not url encoded.

But with warc2zim2, we store entries with a query string and we indeed need to better understand the mechanism.

Jaifroid commented 4 months ago

Meaning that https://foo.org/path€/entry?key=value is stored as \x68\x74\x74\x70\x73\x3a\x2f\x2f\x66\x6f\x6f\x2e\x6f\x72\x67\x2f\x70\x61\x74\x68\xe2\x82\xac\x2f\x65\x6e\x74\x72\x79\x3f\x6b\x65\x79\x3d\x76\x61\x6c\x75\x65

Thank you @mgautierfr! Let me be sure I've understood that this is a hexadecimal string that can be decoded directly from hexadecimal to the UTF-8 representation https://foo.org/path€/entry?key=value? In other words, you are suggesting no percent-encoding of UTF-8 characters for zimit2 (like openzim, and as per spec)?

mgautierfr commented 4 months ago

Here some python code:

>>> bytes_url = "https://foo.org/path€/entry?key=value".encode("utf8")
>>> bytes_url
b'https://foo.org/path\xe2\x82\xac/entry?key=value'
>>> list(bytes_url)
[104, 116, 116, 112, 115, 58, 47, 47, 102, 111, 111, 46, 111, 114, 103, 47, 112, 97, 116, 104, 226, 130, 172, 47, 101, 110, 116, 114, 121, 63, 107, 101, 121, 61, 118, 97, 108, 117, 101]
>>> ''.join(['\\x%02x' % b for b in bytes_url])
'\x68\x74\x74\x70\x73\x3a\x2f\x2f\x66\x6f\x6f\x2e\x6f\x72\x67\x2f\x70\x61\x74\x68\xe2\x82\xac\x2f\x65\x6e\x74\x72\x79\x3f\x6b\x65\x79\x3d\x76\x61\x6c\x75\x65'
kelson42 commented 4 months ago

I think we agree to reword "URL" in "path" in the spec because:

benoit74 commented 4 months ago

I'm not sure to understand how libzim made this change. The spec speaks about a urlPtrPos and I still find this in libzim code. Could someone propose an update of the ZIM spec which prefer whenever possible speaking about paths instead of URLs but still use proper words aligned with the code base? Or should we also update this in the codebase?

mgautierfr commented 3 months ago

I have updated the spec to rename "url" to "path". https://wiki.openzim.org/w/index.php?title=ZIM_file_format&type=revision&diff=31354&oldid=31353

With #869 (merged) I think we can close this issue.

benoit74 commented 3 months ago

Thank you!

I think we should consider this as minor change in the specification (new version would be 6.3) so that we can document it clearly in the "Changelog" and it becomes more obvious for all contributors which are not in the core-team / core libzim maintainers.

Once we settled on this suggestion, I agree we can close the issue from my PoV.

rgaudin commented 3 months ago

@mgautierfr and @kelson42 are to answer obviously but I don't think it should trigger a spec version. We've only changed wording in the documentation. The spec has no API per say and all those renamed headers are just offsets in the binary. We can mention previous label for each of the renamed in the Wiki, that would make more sense to me.

mgautierfr commented 3 months ago

I have added a paragraph in spec about the change itself (https://wiki.openzim.org/w/index.php?title=ZIM_file_format&type=revision&diff=31356&oldid=31354)

I'm closing the issue. Open it if you disagree.

Jaifroid commented 3 months ago

Apologies, @mgautierfr I was away so didn't get to this earlier. I just noticed a couple of minor typos:

Note that is is just a wording change

Change to "... this is just a wording change"

Implementation doesn't have to change anything

Would be better in plural: "Implementations do not have to change anything". However, if you prefer singular, then it would need to be "The implementation...". But plural makes more sense to me.

I hope this is helpful!

kelson42 commented 3 months ago

@Jaifroid Be bolf, feel free to make the change yourself

Jaifroid commented 3 months ago

OK, will do!

Jaifroid commented 3 months ago

Done: https://wiki.openzim.org/w/index.php?title=ZIM_file_format&type=revision&diff=31357&oldid=31356