openzim / warc2zim

Command line tool to convert a file in the WARC format to a file in the ZIM format
https://pypi.org/project/warc2zim/
GNU General Public License v3.0
44 stars 4 forks source link

Zimit2: Fix URL encoding of ZIM items #206

Closed kelson42 closed 7 months ago

kelson42 commented 8 months ago

Original page: https://thalesdocs.com/gphsm/luna/7/docs/network/Content/Home_Luna.htm

ZIM page: https://dev.library.kiwix.org/viewer#thalesdoc_en_all_2024-03/thalesdocs.com/gphsm/luna/7/docs/network/Content/Home_Luna.htm

In the ZIM, all the PDF are missing, although nothing special about them is to remark from my POV: links are straight links to PDF files in the same mirrored domain.

Remark: PDF filenames have spaces!

Jaifroid commented 8 months ago

Just curious as to whether this is a situation we need to handle. Clearly there's a web design issue on that site in that they forgot to URI-encode the URLs. Browsers are lax enough to work around that. The failure point could be anywhere: in the crawler, in the Archiving software (Replay), or in warc2zim, or even in the reader (Kiwix Serve or other reader)... We should probably double-check the problem isn't in our readers.

EDIT: I see you say the PDFs are missing -- I presume you mean they never made their way into the ZIM at all?

rgaudin commented 8 months ago

ZIM has plenty of PDF:

❯ zimls thalesdoc_en_all_2024-03.zim |grep -E "\.pdf$" |wc -l
     719

Counter says application/pdf=616;

rgaudin commented 8 months ago

Entry thalesdocs.com/gphsm/luna/7/docs/network/Content/PDF_Network/Product%20Overview.pdf is in the ZIM but requesting it seems impossible due to the decoding issue

❯ curl -I 'https://dev.library.kiwix.org/content/thalesdoc_en_all_2024-03/thalesdocs.com/gphsm/luna/7/docs/network/Content/PDF_Network/Product%20Overview.pdf'
HTTP/2 404
Jaifroid commented 8 months ago

Ah, so the issue may be reader-side. Oh no, yet another 2GB file to download in order to test properly.....

Jaifroid commented 8 months ago

I confirm PDFs are in ZIM, but they can't be accessed in the PWA or Browser Extension either. Failure seems to be because the backend URI-decodes the URL before requesting it from the ZIM, but unfortunately, as you see from screenshot the PDF URLs are actually stored URI-encoded in the ZIM.

I think if these URLs were stored with spaces -- the ZIM spec does say that all URLs should be stored unencoded, but not sure where we stand on that wrt Zimit2 --- then I believe the readers would work, though it needs testing.

@mgautierfr?

image

kelson42 commented 8 months ago
Jaifroid commented 8 months ago

Not glad to read this is still an URL encoding problem (why such kind of problems so late in the project!)

This kind of issue was rife with zimit1 -- see the first bullet point in #86 --, but I think the issue was masked by the Replay software. I noticed the inconsistencies because I was trying to handle reading ZIMs without the full Replay system.

Now we don't have the sticking plaster of the exact same software being used to encode and to decode. It would be useful to establish whether zimit2 archives should, at least in principle, respect the OpenZIM spec on the question of storing URLs unencoded or whether we continue to consider them a "special case" like with zimit1. Query strings are a particular issue IMHO, especially as there is an issue with Kiwix Serve not being able to handle a raw ? in a URL-with-querystring if I've understood previous remarks correctly (I may have misunderstood).

mgautierfr commented 8 months ago

I confirm this is a uri encoding issue. This works : https://dev.library.kiwix.org/content/thalesdoc_en_all_2024-03/thalesdocs.com/gphsm/luna/7/docs/network/Content/PDF_Network/Product%2520Overview.pdf Because we have an pdf named thalesdocs.com/gphsm/luna/7/docs/network/Content/PDF_Network/Product%20Overview.pdf in the zim file, which is the url encoded version of thalesdocs.com/gphsm/luna/7/docs/network/Content/PDF_Network/Product Overview.pdf

Now we don't have the sticking plaster of the exact same software being used to encode and to decode. It would be useful to establish whether zimit2 archives should, at least in principle, respect the OpenZIM spec on the question of storing URLs unencoded or whether we continue to consider them a "special case" like with zimit1. Query strings are a particular issue IMHO, especially as there is an issue with Kiwix Serve not being able to handle a raw ? in a URL-with-querystring if I've understood previous remarks correctly (I may have misunderstood).

This is already specified (https://github.com/openzim/warc2zim/blob/warc2zim2/src/warc2zim/url_rewriting.py#L4-L46) Item must be stored using unencoded path (ie Product Overview.pdf)

At this time, I still don't know from where the encoding comes from (warc2zim, scrapper, or source content)

mgautierfr commented 8 months ago

The path is url encoded in the warc (https://thalesdocs.com/gphsm/luna/5.4/docs/usb/Content/PDF_G5/Product%20Overview.pdf) and we normalize it to thalesdocs.com/gphsm/luna/5.4/docs/usb/Content/PDF_G5/Product%20Overview.pdf (we don't url decode it).

Decoding it is pretty simple. The real question is do we have to do it ? Here we know that we have to decode it as we know it doesn't work. But an url as : exemple.com/get100%becauseyouworthit must probably not. And exemple.com/get100%78moremoney may be decoded to exemple.com/get100xmoremoney.

In fact, either:

kelson42 commented 8 months ago

@mgautierfr URL paths should better not be encoded IMHO. Think about the fact that you have actually many different but valid ways to encode the very same URL.

mgautierfr commented 8 months ago

I agree. My point is that we don't know if a path is currently url encoded (and we have to decode it). A path may contain a % and still be already decoded.

Jaifroid commented 8 months ago

In fact, either:

  • we know that browsertrix always url encode the path and we know that we must always decode it
  • it is probably not possible to know if the url is correctly encoded or not (it mainly depends on how the link in the html is encoded or not)

I agree that it is not possible to know whether a specific URL should be decoded or not. I struggled with this a lot in my first Zimit (1) implementation, because (as I noted in #86) sometimes they were stored encoded and sometimes not, and I suppose it depended on some aspect of the encoding in the original website.

One possibility that occurs to me is that the WARC software is in fact encoding all URLs with encodeURI(). Mostly this makes no difference, because if a Web site follows best practice and uses mostly ASCII characters (and no space) for its links, the encoding won't change anything. So we'll only see a problem in those rare cases where a Web site is using exotic characters in links without encoding them or, like in this case, unencoded spaces.

It might be difficult to find a test case to check this. Some Chinese sites?

benoit74 commented 8 months ago

I will check browsertrix crawler behavior and see what has to be done. Might be that we will have to discuss the point with them as well (not a problem).

Jaifroid commented 8 months ago

My idea is that there may be a missing decodeURI step in warc2zim when moving URLs from the WARC into the ZIM. As I mentioned, most URLs can be encoded and decoded as many times as you like without altering them (though the URI component might be a snag in the works here), because most already use only the legal characters for a URL. So we're only likely to see this issue in rare cases like this.

NB I have not looked at the code. This is merely a (wild) guess for now.

benoit74 commented 8 months ago

There is no decodeURI in warc2zim on purpose for now. We do not know (yet) if WARC files (especially the ones from Browsertrix Crawler) contains URLs which are always url-encoded.

If it is not the case, we cannot decode in warc2zim since it will lead to other problems.

URLs can be encoded and decoded as many time as wanted, but you cannot decode something which is not already encoded, or you will transform its value (and create other problems)

Jaifroid commented 8 months ago

Yes, of course, I agree. My experience from zimit1 was mixed: sometimes URLs could only be extracted after decoding, and sometimes only when asking for the encoded version. I never got to the bottom of why... (I actually opted to try encoded, and then try unencoded in my original zimit1 support -- very clunky!).

benoit74 commented 8 months ago

Having a look at browsertrix crawler codebase and behavior, as well as browsers behaviors, I confirm that the crawler does no does not seems to encode any URL on its own, but browsers definitely do. I asked a question at https://github.com/webrecorder/browsertrix-crawler/issues/492 to have confirmation from Webrecorder folks.

The browser is processing the URL found in the HTML code before issuing the request. For example at https://tmp.kiwix.org/ci/test-website/url-encoding.html, the first image src is ./images/urlencoding1_icône-débuter-Solidarité-Numérique_1@300x.png but the browser does request https://tmp.kiwix.org/ci/test-website/images/urlencoding1_ico%CC%82ne-de%CC%81buter-Solidarite%CC%81-Nume%CC%81rique_1@300x.png and this is exactly the value of the WARC-Target-URI.

So it looks like the path is mostly url-encoded, except special character @ which has been kept "as-is". This seems to be in line with RFC3986 requirements on when to url-encode: https://datatracker.ietf.org/doc/html/rfc3986.html#section-2.4, and its specification of which characters are reserved and should not be encoded.

On https://tmp.kiwix.org/ci/test-website/url-encoding.html, this meant we have two warc records for the 3 images.

Image # HTML value WARC record # WARC record header value
1 ./images/urlencoding1_icône-débuter-Solidarité-Numérique_1@300x.png 1 https://tmp.kiwix.org/ci/test-website/images/urlencoding1_ico%CC%82ne-de%CC%81buter-Solidarite%CC%81-Nume%CC%81rique_1@300x.png
2 ./images/urlencoding1_ico%CC%82ne-de%CC%81buter-Solidarite%CC%81-Nume%CC%81rique_1%40300x.png 2 https://tmp.kiwix.org/ci/test-website/images/urlencoding1_ico%CC%82ne-de%CC%81buter-Solidarite%CC%81-Nume%CC%81rique_1%40300x.png
3 ./images/urlencoding1_icône-débuter-Solidarite%CC%81-Nume%CC%81rique_1@300x.png 1 https://tmp.kiwix.org/ci/test-website/images/urlencoding1_ico%CC%82ne-de%CC%81buter-Solidarite%CC%81-Nume%CC%81rique_1@300x.png

Side-node: on tmp.kiwix.org, server always serves the file named without URL encoding (i.e. it decodes the URL before searching for a file), this is why the urlencoding2_xxx and urlencoding3_ images are all broken. I'm not sure it will be the case with all web servers.

Back to warc2zim, I think that the change needed on the scraper is not exactly straightforward.

solution 1

For every src / href found in the scraper, we need to url-encode it to find the proper WARC record. The opposite (decoding the WARC record header) is not possible because otherwise we will have a collision/confusion between WARC records 1 and 2 in table above which have the same decoded value but represent two records with potentially different content. While probably an extreme edge case, it might happen.

How to properly url-encode is still a bit of a mystery, it looks like we must first split the URL on every reserved character of RFC3986 and only url-encode each part without the reserved characters which are kept as-is.

This also means that while we have to create two ZIM entries (one per WARC record above), we are supposed to store their path without URL encoding to match the ZIM specification. So the path of both ZIM entries would be identical ... and we would hence have a conflict.

solution 2

Another solution would be to consider this is too complex, that the risk of collision illustrated in https://tmp.kiwix.org/ci/test-website/url-encoding.html is both extremely rare AND usually leading to the same content, and we still do not know how to store these collisions in the ZIM.

This would mean we can directly decode the WARC-Target-URI because we know it is always encoded, and then only process decoded values.

For the rare collision situations, we can search for these collisions before anything else in the scraper, and then check if content is really distinct, in which case it raises an error and stops the scraper, so we will know about it. Since we do not expect this to happen, it would probably not be a concern.

conclusion

My preference for now is to go for solution 2. While not totally exact (collisions could theoretically happen), it is both the simplest move and the one the more in-line with the ZIM specification which expects paths without URL encoding.

Feedback on this is welcomed!

rgaudin commented 8 months ago

Solution2 would still need to decode properly according to the RFC (ie. part by part). We've seen that already. On WikiHow maybe?

benoit74 commented 8 months ago

Why do you need to decode part by part in solution 2? Since the URL is already encoded, every % sign is encoded, so you can decode everything at once safely. I don't get it, sorry. And I don't remember to have already seen it (even if you are probably right)

Jaifroid commented 8 months ago

Just to add my tuppence: in JavaScript at least (and I know that may not be the context in which decoding would be done in warc2zim2), there are two decoding functions: decodeURI and decodeURIComponent. They both decode slightly different things.

The former does not decode encoded characters that are part of the URI syntax (e.g. ;, /, ?, :, @, &, =, +, $, #). The latter only decodes things that are between those separators (the components of the URL).

So:

console.log(decodeURI("%26")); // Outputs: "%26"
console.log(decodeURIComponent("%26")); // Outputs: "&"

Which to use would depend on how a URL has been encoded in the first place. Component separators normally wouldn't be in encoded form unless they've been "wrongly" encoded. A raw ampersand should always be part of the querystring, and should never appear elsewhere in an encoded URL (and it should never be encoded if it is part of the querystring and is a separator).

For example, if we have a URL that looks like: Storybook.html?title=Hansel%26Gretel&author=Grimm, we must be careful not to decode the first %26 to an ampersand, as it will then create an invalid URL. And I presume it must be stored in the ZIM without decoding the first ampersand as well.

Jaifroid commented 8 months ago

Sorry for further post, but I presume we need to be 100% clear on whether we want to store valid URLs in the ZIM, or whether we want to store fully-decoded strings that are not in fact valid URLs (which doesn't matter, so long as the correct string is presented to the backend to be able to locate the entry in the ZIM). And do we want to treat the querystring differently from the rest of the URL?

In the end, unless we want to do a lot of work adapting the readers, we are forced to store in the ZIM a URL in a form that current readers (or libzim?) will transparently derive from a given URL.

benoit74 commented 8 months ago

You are right, we need to decode only the path, not the query and probably not the fragments. Thank you

benoit74 commented 8 months ago

Nevermind, fragments are ignored to compute the entry path anyway, so we just need to decode the path.

benoit74 commented 8 months ago

I wonder what we should do regarding hostname which are stored in punycode in the WARC (because the browser requests it with punycode). This is definitely an edge case, but needs to be tackled as well. The complexity comes from the fact that:

Another concern I have is that we say that ZIM item path should not be urlencoded but match RFC1738: https://wiki.openzim.org/wiki/ZIM_file_format#URL_Encoding

The URLs in the UrlPointerlist are utf-8 and are not url encoded (https://www.ietf.org/rfc/rfc1738.txt)

This is a contradiction, because as far as I've understood, utf-8 chars are not allowed in HTTP url according to RFC1738. There is something I don't get.

Jaifroid commented 8 months ago

This is a contradiction, because as far as I've understood, utf-8 chars are not allowed in HTTP url according to RFC1738. There is something I don't get.

This is what I meant when I said we need to be 100% clear whether we are storing valid URLs, or whether we are simply storing ZIM URLs, which are strings that are not valid URLs.

The difference is that traditionally, a ZIM URL would have fully decoded characters, for example, a title like Hansel & Gretel would have a ZIM URL like C/Hansel & Gretel instead of C/Hansel%20%26%20Gretel. With Zimit 1, we suddenly found that we were storing URLs, so we would get C/A/www.storybook.com/Hansel%20%26%20Gretel. This violates the OpenZIM spec (which of course can be changed if necessary).

As I understand it, the proposal is to revert to ZIM URLs, which means we store unencoded. It doesn't matter of course whether what is stored is a valid URL or not, it just has to be a string pointer to the asset that we want to extract from the ZIM. The point is that the readers, as they currently stand (assuming we don't want to modify them) must be able to derive C/Hansel & Gretel from a URL like ../Hansel%20%26%20Gretel.

EDIT: mwOffliner uses underscores rather than spaces, but I think this is merely following the way Wikipedia URLs are represented for specific article titles. In practice, as we've seen, we rarely encounter spaces in a URL either encoded or unencoded.

EDIT2: Querystrings pose specific problems, because existing readers may not know how to deal with them, and probably don't decode them from, say, a URL hyperlink (when user clicks on one). But this needs empirical testing (or perhaps all those based on libzim / libkiwix have a standard behaviour).

Jaifroid commented 8 months ago

Some other examples from the Wikipedia ZIM URL list, showing that these are definitely not valid URLs:

image

benoit74 commented 8 months ago

OK, but then why the ZIM specification is speaking about "URL"? And why are we referring to RFC1738?

Either the word "URL" is badly chosen (I can live with it) and we stop referring to RFC1738, or there is something I still don't get.

If we admit that, it means for zimit that:

This seems a bit weird (lots of different way to handle things), but is probably the simplest (only?) solution if we do not want to modify all readers!

kelson42 commented 8 months ago

@benoit74 I would recommend here to have a:

Jaifroid commented 8 months ago

@benoit74 My cautions about querystring handling are not based on testing in all the readers, just on what I had to do to make it work in KJS / PWA. An empirical, test-driven solution focusing on libzim's handling sounds good.

Jaifroid commented 8 months ago

Summary of where we're at and why it's a Catch22 situation IMHO

What we found out in https://github.com/openzim/libzim/issues/865:

Why this is a Catch22 situation to deal with:


This is not hypothetical. There is no way I can code a workaround for this situation that is guaranteed to work in all cases, and it's not possible either in Kiwix Serve to deal with percent-encoded URLs that contain encoded spaces or legitimately encoded question marks with 100% certainty. I see only two solutions:

  1. EITHER: Stop over-encoding the querystring and sort out the code that strips out real querystrings in libkiwix and/or the server
  2. OR: Add an extra decoding step in warc2zim2 so that why_do_cats_smile%3F.html?theme=dark&title=Why%20do%20cats%20smile%3F will be stored in the ZIM as why_do_cats_smile?.html?theme=dark&title=Why do cats smile?

Personally, I think there are probably too many unknowns and risks to do 2 at this late stage, not least the huge querystrings that control video. This drives me to the conclusion that 1 is what is needed, and that you should additionally alter the OpenZIM spec to say that Zimit ZIMs (zimit1 and zimit2) store conforming RFC1738 URLs with percent encoding. I think this is de facto the case.

benoit74 commented 8 months ago

Solution 1 is a nogo for me on the short term because it probably means updating all readers, not only libkiwix (you can maybe easily update KJS, but we cannot update all readers which are in the wild easily, we are an offline service providers and cannot assume readers are regularly updated).

And it does not solve the core of this issue. Querystring are not questioned in Thales ZIM, all but URLs with special characters are not working. Any item with a special character is simply not working currently.

I'm close to have implemented something like solution 2. We will discuss this in corresponding PR and you will have a ZIM to test soon (by the end of the day hopefully).

Jaifroid commented 8 months ago

I'm close to have implemented something like solution 2. We will discuss this in corresponding PR and you will have a ZIM to test soon (by the end of the day hopefully).

I understand and thanks for your hard work! No rush from my end. I'll be happy to test.

Any item with a special character is simply not working currently.

In my draft PR I got Thale PDFs working in KJS by re-encoding the (decoded) URL with encodeURI() (in my case, in the reader) just before extracting from the ZIM. Seems that might not be an option for you, though if you can't alter any code in libkiwix to fix this.

mgautierfr commented 8 months ago

My preference for now is to go for solution 2. While not totally exact (collisions could theoretically happen), it is both the simplest move and the one the more in-line with the ZIM specification which expects paths without URL encoding.

Feedback on this is welcomed!

I totally agree with solution 2.

Having two records in the warc seems to be a "bug" for me. (Not a real bug from warc pov as warc store a visit of a site web, not a site web) The string ./images/urlencoding1_ico%CC%82ne-de%CC%81buter-Solidarite%CC%81-Nume%CC%81rique_1%40300x.png (link nb 2) in the html is "overencoded" (in the sens that some chars are encoded even if we don't need it) but all links decoded to the same real/canonical url : ./images/urlencoding1_icône-débuter-Solidarité-Numérique_1@300x.png

We should store this one and only this one. Having two different content for the same url would be a bug.

If we know that the browser always urlencode than we should always urldecode.

Example, say user clicks on a link : why_do_cats_smile%3F.html%3Ftheme%3Ddark

I think you example is wrong. why_do_cats_smile%3F.html%3Ftheme%3Ddark[1] is a url (or can be interpreted as) and this url doesn't contain any querystring. That is the purpose of urlencoding : have a url with ? char which is not interpreted as a separator between path and querystring.

In this case, the "real" path is why_do_cats_smile?.html?theme=dark[2] and (according to current specification), we have to store it as it. When user click on the link, we receive [1], decode it to [2] (libmicrohttpd does) and we are good.

Jaifroid commented 8 months ago

Thank you -- I'm perfectly happy with Solution 2, and can work with it. It respects the spec. Was just getting nervous about how it would work in practice with zimit2 ZIMs (I presume we don't touch zimit1).

Jaifroid commented 7 months ago

For those testing the solidarité numérique ZIM in #218, I found an embedded video in the article "Utiliser le Décodex pour vérifier les fausses informations ou fake news" (it's the tile next to "Comprendre les cookies".) The video works fine in the PWA (v3.1.5), but it's not working in Kiwix Desktop 2.3.1 (on Windows):

image

I'm posting it in the issue rather than in the PR, because I don't think it's related to the specific PR, but could be related to encoding of querystrings. @benoit74 I can't remember where we're at with video in zimit2, and whether this would be a regression or not.

EDIT: Video also not working in latest nightly (27-03-24) appimage version on Linux (WSL).

benoit74 commented 7 months ago

Solved by https://github.com/openzim/warc2zim/pull/218