kiwix / libkiwix

Common code base for all Kiwix ports
https://download.kiwix.org/release/libkiwix/
GNU General Public License v3.0
118 stars 56 forks source link

New scenario for broken suggestion JSON #1020

Closed kelson42 closed 10 months ago

kelson42 commented 10 months ago

With cutting-edge dev kiwix-serve:

$ curl -s "https://dev.library.kiwix.org/suggest?userlang=en&content=thalesdoc_en_all_2023-11&term=factory" | jq
parse error: Invalid string: control characters from U+0000 through U+001F must be escaped at line 35, column 1

Or

$ curl -s "https://dev.library.kiwix.org/suggest?userlang=en&content=thalesdoc_en_all_2023-11&term=factory" 
[
  {
    "value" : "Factory-Installed HSM Configurations",
    "label" : "<b>Factory</b>-Installed HSM Configurations",
    "kind" : "path"
      , "path" : "A/thalesdocs.com/gphsm/luna/6.2.1/docs/usb/Content/overview/configurations/factory_configurations.htm"
  },
  {
    "value" : "Factory-Installed HSM Configurations",
    "label" : "<b>Factory</b>-Installed HSM Configurations",
    "kind" : "path"
      , "path" : "A/thalesdocs.com/gphsm/luna/6.2/docs/usb/Content/overview/configurations/factory_configurations.htm"
  },
  {
    "value" : "Factory-Installed HSM Configurations",
    "label" : "<b>Factory</b>-Installed HSM Configurations",
    "kind" : "path"
      , "path" : "A/thalesdocs.com/gphsm/luna/6.1/docs/network/Content/overview/configurations/factory_configurations.htm"
  },
  {
    "value" : "Factory-Installed HSM Configurations",
    "label" : "<b>Factory</b>-Installed HSM Configurations",
    "kind" : "path"
      , "path" : "A/thalesdocs.com/gphsm/luna/6.2.2/docs/usb/Content/overview/configurations/factory_configurations.htm"
  },
  {
    "value" : "Factory-Installed HSM Configurations",
    "label" : "<b>Factory</b>-Installed HSM Configurations",
    "kind" : "path"
      , "path" : "A/thalesdocs.com/gphsm/luna/6.0/docs/usb/Content/overview/configurations/factory_configurations.htm"
  },
  {
    "value" : "
            sysconf appliance factory
",
    "label" : "sysconf appliance <b>factory</b>...",
    "kind" : "path"
      , "path" : "A/thalesdocs.com/gphsm/ptk/protectserver3/docs/ps_ptk_docs/psesh/psesh_commands/sysconf/sysconf_appliance/sysconf_appliance_factory/index.html"
  },
  {
    "value" : "Resetting to Factory Condition",
    "label" : "Resetting to <b>Factory</b> Condition",
    "kind" : "path"
      , "path" : "A/thalesdocs.com/gphsm/luna/6.1/docs/network/Content/administration/decommission/factory_reset_sa.htm"
  },
  {
    "value" : "Resetting to Factory Condition",
    "label" : "Resetting to <b>Factory</b> Condition",
    "kind" : "path"
      , "path" : "A/thalesdocs.com/gphsm/luna/7.2/docs/network/Content/administration/decommission/factory_reset.htm"
  },
  {
    "value" : "Resetting to Factory Condition",
    "label" : "Resetting to <b>Factory</b> Condition",
    "kind" : "path"
      , "path" : "A/thalesdocs.com/gphsm/luna/6.2/docs/usb/Content/administration/decommission/factory_reset_sa.htm"
  },
  {
    "value" : "Resetting to Factory Condition",
    "label" : "Resetting to <b>Factory</b> Condition",
    "kind" : "path"
      , "path" : "A/thalesdocs.com/gphsm/luna/6.2.1/docs/usb/Content/administration/decommission/factory_reset_sa.htm"
  },
  {
    "value" : "factory ",
    "label" : "containing 'factory'...",
    "kind" : "pattern"

  }
]

JSON is invalid and kiwix-serve suggest feature is broken :(

veloman-yunkan commented 10 months ago

I can fix that, but I would like to draw your attention to the fact that the title of the page at https://dev.library.kiwix.org/content/thalesdoc_en_all_2023-11/A/mp_/https://thalesdocs.com/gphsm/ptk/protectserver3/docs/ps_ptk_docs/psesh/psesh_commands/sysconf/sysconf_appliance/sysconf_appliance_factory/index.html contains newline symbols. Shouldn't it be addressed during scraping?

kelson42 commented 10 months ago

We probably have additional potential tickets indeed for warc2zim and zimcheck. @rgaudin @benoit74 ?

@veloman-yunkan ... but still, kiwix-serve should not break... and be more tolerant/robust.

rgaudin commented 10 months ago

spec should be updated as well if we want to forbid those characters (which?)

veloman-yunkan commented 10 months ago

This issue should be easy to fix by fully escaping the JSON strings (currently only backslashes are escaped).

Note however that this particular case can be fixed by changing (optimizing) the output of the suggestions endpoint as follows (which, IMHO, is justified regardless of this issue).

In the current implementation of suggestions in the viewer the value attribute in suggestion entries is used only for the last one (suggesting to perform a full search). So if we don't set the value attribute (containing the text of the title) for the rest (all-but-one) of the suggestion we should be good. Going even further I don't think that it makes a lot of sense for the backend to add the last suggestion proposing a full text search - that can be done in JS code in the frontend.

@mgautierfr What do you think?

veloman-yunkan commented 10 months ago

This issue should be easy to fix by fully escaping the JSON strings (currently only backslashes are escaped).

Implementing the fix will take less effort than adding a regression test for it. I don't think that this edge case is represented in our test data so I will have to update one of our test ZIM files, which is something that I would like to avoid.

kelson42 commented 10 months ago

spec should be updated as well if we want to forbid those characters (which?)

Probably not the perfect solution, but might be "good enough" for the moment: https://wiki.openzim.org/w/index.php?title=ZIM_file_format&type=revision&diff=31304&oldid=31263

kelson42 commented 10 months ago

Now we should probably:

rgaudin commented 10 months ago

I am against putting this kind of things into pylibzim which should remain a wrapper. I know @mgautierfr agrees 😉

We could enforce that in scraperlib though. If there's a clear consensus that all this range should be ignored, we can do it transparently and if not, it could be a check and handling would be in the scrapers (warc2zim mostly)

mgautierfr commented 10 months ago

In the current implementation of suggestions in the viewer the value attribute in suggestion entries is used only for the last one (suggesting to perform a full search). So if we don't set the value attribute (containing the text of the title) for the rest (all-but-one) of the suggestion we should be good.

We should be good. I'm a bit worry about consistency but I'm ok with this move.

Going even further I don't think that it makes a lot of sense for the backend to add the last suggestion proposing a full text search - that can be done in JS code in the frontend.

Ideally yes. But we put the full text search for suggestion only if we have fulltext search available. Doing on the frontend would make more sense, but we have to find a way for frontend to know if fulltext search is available in the zim file.