Closed kelson42 closed 10 months ago
I can fix that, but I would like to draw your attention to the fact that the title of the page at https://dev.library.kiwix.org/content/thalesdoc_en_all_2023-11/A/mp_/https://thalesdocs.com/gphsm/ptk/protectserver3/docs/ps_ptk_docs/psesh/psesh_commands/sysconf/sysconf_appliance/sysconf_appliance_factory/index.html contains newline symbols. Shouldn't it be addressed during scraping?
We probably have additional potential tickets indeed for warc2zim and zimcheck. @rgaudin @benoit74 ?
@veloman-yunkan ... but still, kiwix-serve should not break... and be more tolerant/robust.
spec should be updated as well if we want to forbid those characters (which?)
This issue should be easy to fix by fully escaping the JSON strings (currently only backslashes are escaped).
Note however that this particular case can be fixed by changing (optimizing) the output of the suggestions endpoint as follows (which, IMHO, is justified regardless of this issue).
In the current implementation of suggestions in the viewer the value
attribute in suggestion entries is used only for the last one (suggesting to perform a full search). So if we don't set the value
attribute (containing the text of the title) for the rest (all-but-one) of the suggestion we should be good. Going even further I don't think that it makes a lot of sense for the backend to add the last suggestion proposing a full text search - that can be done in JS code in the frontend.
@mgautierfr What do you think?
This issue should be easy to fix by fully escaping the JSON strings (currently only backslashes are escaped).
Implementing the fix will take less effort than adding a regression test for it. I don't think that this edge case is represented in our test data so I will have to update one of our test ZIM files, which is something that I would like to avoid.
spec should be updated as well if we want to forbid those characters (which?)
Probably not the perfect solution, but might be "good enough" for the moment: https://wiki.openzim.org/w/index.php?title=ZIM_file_format&type=revision&diff=31304&oldid=31263
Now we should probably:
zimcheck
I am against putting this kind of things into pylibzim which should remain a wrapper. I know @mgautierfr agrees 😉
We could enforce that in scraperlib though. If there's a clear consensus that all this range should be ignored, we can do it transparently and if not, it could be a check and handling would be in the scrapers (warc2zim mostly)
In the current implementation of suggestions in the viewer the value attribute in suggestion entries is used only for the last one (suggesting to perform a full search). So if we don't set the value attribute (containing the text of the title) for the rest (all-but-one) of the suggestion we should be good.
We should be good. I'm a bit worry about consistency but I'm ok with this move.
Going even further I don't think that it makes a lot of sense for the backend to add the last suggestion proposing a full text search - that can be done in JS code in the frontend.
Ideally yes. But we put the full text search for suggestion only if we have fulltext search available. Doing on the frontend would make more sense, but we have to find a way for frontend to know if fulltext search is available in the zim file.
With cutting-edge dev
kiwix-serve
:Or
JSON is invalid and
kiwix-serve
suggest feature is broken :(