kiwix / kiwix-tools

Command line Kiwix tools: kiwix-serve, kiwix-manage, ...
https://download.kiwix.org/release/kiwix-tools/
GNU General Public License v3.0
408 stars 79 forks source link

& symbol not redirected properly and makes additional symbols #587

Closed nijazm closed 1 year ago

nijazm commented 1 year ago

When I enter in my web browser in english wikipedia zim started by kiwix-serve symbol & in search and click on autocomplete result, automatically this appears in search bar & and it goes to fulltext search, so url becomes something like this: http://localhost:8181/search?content=wikipedia_en_all_maxi_2021-12&pattern=%26amp%3B

However when I directly enter in url: http://localhost:8181/wikipedia_en_all_maxi_2021-12/A/& it redirects properly to http://localhost:8181/wikipedia_en_all_maxi_2021-12/A/Ampersand

I have not tested other symbols, but that reminds me of similar errors encountered on some site, where entering symbol " also leads to errors, causing some additional characters to appear. Is it encoding or something, I don't know. Also similar error happens in kiwix-desktop app, meaning there is no autocomplete result for & but only fulltext search.

kelson42 commented 1 year ago

I would wait that we support latest libzim in node-libzim and mwoffliner before investigating this. Strongly suspect this has been somehow fixed already in the libzim. Clearly depends on https://github.com/openzim/mwoffliner/issues/1576

veloman-yunkan commented 1 year ago

This issue is present (though in a slightly different way) with the new iframe-based viewer too - the fulltext search URL is http://localhost:8181/search?content=wikipedia_en_all_maxi_2021-12&pattern=& where the ampersand symbol in pattern=& is not URL encoded.

kelson42 commented 1 year ago

@veloman-yunkan OK, so at least we can fix that one.

veloman-yunkan commented 1 year ago

BTW, the issue described in my previous comment is under Firefox 107.0. Debugging shows strange/counter-intuitive things happening, like the browser implicitly converting/decoding URLs ~during assignment to innerHTML attribute of DOM elements~ (this actually turns out to be an inherent property of the href attribute; see below comments). I am not sure that web-browsers based on a different web-engine have the same behaviour, which may explain the issue as observed by OP.

@nijazm What is your browser?

veloman-yunkan commented 1 year ago

So it rather turned out to be automatic decoding of any URL-encoded characters in the value of the href attribute of the <a> HTML element.

Proof on a minimal example:

<!DOCTYPE html>
<html>
  <head>
  </head>
  <body>
    <a href="javascript:alert('ABCD%26EFGH')">Click me!</a>
  </body>
</html>

When the link is clicked, the message box displays "ABCD&EFGH" (instead of "ABCD%26EFGH").

veloman-yunkan commented 1 year ago

A more convincing version:

<!DOCTYPE html>
<html>
  <head>
    <script>
      function foo() { alert('ABCD%26EFGH'); }
    </script>
  </head>
  <body>
    <a href="javascript:alert('ABCD%26EFGH')">Click me!</a>
    <a href="javascript:foo()">Click me too!</a>
  </body>
</html>

The first hyperlink containing inline javascript in the href attribute displays URL-decoded text. The second hyperlink display the intended text as is.

veloman-yunkan commented 1 year ago

A somewhat related question on stackoverflow: https://stackoverflow.com/questions/33721510/why-use-url-encoding-instead-of-html-encoding-for-the-href-attribute

nijazm commented 1 year ago

Same issues occur on web browsers last versions of Chrome and Firefox, on Windows 11.

veloman-yunkan commented 1 year ago

@nijazm This should now be fixed on master but it looks like you are using the previous release of kiwix-serve. Is that correct? What is the output of kiwix-serve --version on your side?

kelson42 commented 1 year ago

To be check with latest nightly https://download.kiwix.org/nightly/

kelson42 commented 1 year ago

Fixed by https://github.com/kiwix/libkiwix/pull/859

nijazm commented 1 year ago

I just tested today's version of kiwix desktop and kiwix tools on Windows 11.Now just shows fulltext search autocomplete result for & symbol and when I click on it app it says No results were found for "&". Both for kiwix desktop and kiwix serve (web browsers). In search box it shows containing '&'. When I copy url I found in network tab of Inspect, meaning when I open this one: http://localhost:8181/suggest?content=wikipedia_en_all_maxi_2021-12&term=%26 this json is shown:

[
  {
    "value" : "&amp; ",
    "label" : "containing &apos;&amp;&apos;...",
    "kind" : "pattern"

  }
]

When I fix URL by adding & instead of pecent code so it is this: http://localhost:8181/suggest?content=wikipedia_en_all_maxi_2021-12&term=& then this is json response:

[
  {
    "value" : " ",
    "label" : "containing &apos;&apos;...",
    "kind" : "pattern"

  }
]
veloman-yunkan commented 1 year ago

Now that's a different problem. Most likely, the ampersand symbol is treated as punctuation and is simply discarded during the creation of the title index as well as when running suggestion search on it.

Ideally, while building the title index we should handle article names consisting of a single symbol or word in a special way, letting those terms go into the title index as is despite any rules that drop punctuation and stopwords. Also we will have to enhance the suggestion search so that it accounts for such an addition to the title index.

@kelson42 @mgautierfr What do you think? Is this issue worth the effort required to fix it?

kelson42 commented 1 year ago

@veloman-yunkan I'm slightly lost. I would really appreciate a new ticket with a clear reproductuon case.