kiwix / kiwix-tools

Command line Kiwix tools: kiwix-serve, kiwix-manage, ...
https://download.kiwix.org/release/kiwix-tools/
GNU General Public License v3.0
433 stars 85 forks source link

Long titles in suggestion dropdown choices are no longer readable (regression) #513

Open holta opened 2 years ago

holta commented 2 years ago

1) Try a Title Search using kiwix-tools_linux-x86_64-2021-12-17.tar.gz on http://iiab.me/kiwix/wikipedia_en_all_maxi_2021-03/ using the word "apple" — and then click on the topmost of the 10 choices in the search dropdown.

image

It will sends all browsers to this 403 Forbidden page:

http://iiab.me/kiwix/wikipedia_en_all_maxi_2021-03/A/.apple

Does anybody know why a dot (period) gets added to the left of this word (apple), within the result URL above?

Does anybody know why this affects the word "apple" in particular, but does not affect many other Title Searches? How common is this problem among other Title Searches?

2) Is there any way to improve the ability to choose among the 10 dropdown choices in the screenshot above?

When the same single English word is shown in all 10 choices above (with almost no context except for ellipsis etc!) it's suddenly now a lot harder for users to make an intelligent choice.

Whereas in the past, the exact same Title Search (on the word "apple", when using kiwix-serve 3.1.2-5 from 2021-06-09 / 2021-06-10) offered 10 much more readable options — as seen in the search dropdown below:

image

holta commented 2 years ago

I've clarified the explanation above — thanks to anybody who might be able to explain / understand what's happening!

maneeshpm commented 2 years ago

Testing with latest git masters of libzim/libkiwix/kiwix-serve on wikipedia_en_all_mini_2021-01, similar search sends to the correct page http://localhost:8080/wikipedia_en_all_mini_2021-01/A/.apple without any error. @kelson42 Are you able to recreate the issue with the above mentioned version?

@holta We try to follow a suggestion system that is very close to the actual Wikipedia search. If you search Apple on Wikipedia, you will find results with maximum 1 or 2 words in the top 10 suggestions, first word being apple. If a user is searching for apple and there exists a closest result apple in our index, that should be the best result rather than Apples to Apples like in the previous versions. I agree that our system is not "intelligent" because it does not take into account factors like page visits or popularity index which is done in more sophisticated search engines, but we intend to give the user sensible matches to what they actually search for.

holta commented 2 years ago

@maneeshpm do you know why the search dropdown repeatedly shows "Apple..." leaving 80% of the horizontal real estate completely unused?

(No matter what 10 suggestions are offered — there really ought to be a way to visually distinguish between the offered choices — before clicking on any one of them!)

maneeshpm commented 2 years ago

That's a valid concern, for some reason the entire name is not being shown. I'll dig into the issue.

kelson42 commented 2 years ago
  1. Try a Title Search using kiwix-tools_linux-x86_64-2021-12-17.tar.gz on http://iiab.me/kiwix/wikipedia_en_all_maxi_2021-03/ using the word "apple" — and then click on the topmost of the 10 choices in the search dropdown. It will sends all browsers to this 403 Forbidden page: http://iiab.me/kiwix/wikipedia_en_all_maxi_2021-03/A/.apple

I can not confirm this behaviour. Clicking on any of the suggestions leads to the right article. In the future please test a single kiwix-serve which is not in a special environnement or behind a reverse proxy.

Does anybody know why a dot (period) gets added to the left of this word (apple), within the result URL above?

@maneeshpm I confirm this behaviour and it does not seem normal to me.

Here is the json:

  {
    "value" : "Apple //",
    "label" : "<b>Apple</b>...",
    "kind" : "path"
      , "path" : "A/Apple_//"
  },
  {
    "value" : "Apple ///",
    "label" : "<b>Apple</b>...",
    "kind" : "path"
      , "path" : "A/Apple_///"
  },
  {
    "value" : "Apple®",
    "label" : "<b>Apple</b>...",
    "kind" : "path"
      , "path" : "A/Apple®"
  },

value json property seems correct, so I wonder why the label json property has points/ellipsis in place of characters \ or ®. Do you know more?

2. Is there any way to improve the ability to choose among the 10 dropdown choices in the screenshot above?
   When the same single English word is shown in all 10 choices above (with almost no context except for ellipsis etc!) it's suddenly now a lot harder for users to make an intelligent choice.

The suggestions are not pointing to the same article, you have this feeling just because the label with the ellipsis is the same and this HTTP error 403 in IIAB. AFAIK, beside this strange ellipsis behaviour, everything works fine.

   Whereas in the past, the exact same Title Search (on the word "apple", when using kiwix-serve 3.1.2-5 from 2021-06-09 / 2021-06-10) offered 10 much more readable options — as seen in the search dropdown below:

Like underlined by Maneesh, the current results are more pertinent than before. To me, we just need to clarify why we have ellipsis in place of the "real title".

holta commented 2 years ago

The suggestions are not pointing to the same article, you have this feeling

No I do not have this feeling.

I'm not sure why people are claiming this (incorrectly).

kelson42 commented 2 years ago

@maneeshpm After researching a bit, it seems:

But I seem not ticket of that kind open upstream https://trac.xapian.org/search?q=snippet&noquickjump=1&ticket=on

Or maybe we should just wait to see if things are still wrong once we generate Wikipedia ZIM files with libzim7, considering that you have massivelly improved the ZIM creator?

maneeshpm commented 2 years ago

Or maybe we should just wait to see if things are still wrong once we generate Wikipedia ZIM files with libzim7, considering that you have massivelly improved the ZIM creator?

@kelson42 our snippets are completely generated using Xapian::MSet::snippet() with very little control for ourselves. I guess waiting and checking if this issue persists even with new zim files is the way forward.

PS. I would like to mention that in my limited testing with wikipedia_en_all_mini_2021-01, I was not able to find any case where useful info was omitted(replaced with ...). Only trailing parenthesis, or non word characters were being omitted. Yet to confirm this.

holta commented 2 years ago

That's a valid concern, for some reason the entire name is not being shown. I'll dig into the issue.

Thanks @maneeshpm.

Another valid concern is why the most insignificant article (even Wikipedia has since removed the article on Apple Inc's .apple vanity domain name: https://en.wikipedia.org/w/index.php?title=.apple&redirect=no) is placed at the top of the search dropdown list.

When a child wanting to learn about real world apples...should probably be able to do that...without too many clicks (-:

(Of course the dropdown not showing the dot on the left-side of .apple further confuses this difficult user experience.)

In Any Case: while it's likely not possible to fix this in 2021 (e.g. kiwix-tools 3.2.0 is needed by many schools in coming weeks if possible!) this extremely odd ordering[*] has room for improvement in future years ;)

[*] Presumably it's alphabetically ordered among a long list, at the moment?

kelson42 commented 2 years ago

@holta The content index does not really have a way to know what article is important or not. It can only see if there is a word fit beetween the article and the search pattern. For the moment we can not expect it to know that. But we have project to improve that, see for example https://github.com/openzim/libzim/issues/653

holta commented 2 years ago

@kelson42 thanks for explaining & thanks for opening openzim/libzim#653

The content index does not really have a way to know what article is important or not.

A Short-Term Suggestion for "2022" :

If the child searches for "apple", how about showing them the article they actually searched for?

https://en.wikipedia.org/wiki/apple

Or...the (identical after redirect) article:

https://en.wikipedia.org/wiki/Apple

Instead of accidentally/prominently advertising ~10 different Apple(TM) products to the young child!

RECAP: Consider using the search string itself — to help populate the search dropdown — when an article exists with that very same title?

kelson42 commented 2 years ago

@holta We are drifting from original bug report. I would wait newest WPEN zim files with libzim7 made and see then how things behave. If then there is still a problem the please open a new ticket.

kelson42 commented 2 years ago

Depends on https://github.com/openzim/mwoffliner/issues/1606