openzim / libzim

Reference implementation of the ZIM specification
https://download.openzim.org/release/libzim/
GNU General Public License v2.0
164 stars 50 forks source link

Introducing popularity/importance factor at indexing time #653

Open kelson42 opened 2 years ago

kelson42 commented 2 years ago

The problem is that the current indexing (both ft and suggestion) is only based on word occurrence. Even if this works good there is no easy way for the index to know that the Wikipedia "Apple" is more important than ".apple".

The idea would be to be avaible to slightly tweak the current algorithm by giving an external numerical factor at indexing time. That way we could make a ponderation and effectively give better search results.

In the case of Wikipedia such a number could be for example computed (and is actually already computed) within the WP1 project.

holta commented 2 years ago

Thanks for this exploratory thread.

A Short-Term Suggestion for "2022" :

If the child searches for "apple", how about showing them the article they actually searched for?

https://en.wikipedia.org/wiki/apple

Or...the (identical after redirect) article:

https://en.wikipedia.org/wiki/Apple

Instead of accidentally/prominently advertising ~10 different Apple(TM) products to the young child!

RECAP: Consider using the search string itself — to help populate the search dropdown — when an article exists with that very same title?

kelson42 commented 1 year ago

I propose at index time an additional parameter called popularity which would be a value between 0 and 100 (100 been super popular).

Open questions:

@rgaudin @mgautierfr @veloman-yunkan It’s architecture time, your feedbacks are required.

veloman-yunkan commented 1 year ago

@kelson42 My feeling is that introducing this feature may create some trouble (at least, initially).

If the popularity value is based on the page visit counts, then popularity of some pages may be proportional to their age rather than represent the objective interest in the information on that page. For example, some old questions on stack overflow are very highly rated but current interest in them is much lower since those technologies become outdated.

rgaudin commented 1 year ago

@veloman-yunkan those are valid questions but those are scraper level ones… probably for each scraper. @kelson42 mentioned the WP1 data for mwoffliner.

Should this popularity information only feed the indexer or should it create a sorted entry listing as well?

mgautierfr commented 1 year ago

First of all, I disagree with the popularity naming and I prefer some kind of sorting order or importance sementics.

popularity may be a source of information for the scrapper to set the importance but the sorting order don't have to be base on popularity which is really wikipedia oriented (and as @veloman-yunkan mention, is really suggest to caution).

However adding a "importance" field is ok for me and it is probably not technically difficult. But the questions raised by @kelson42 are important one and not easy to answer.

If we take the example of Apple, we want to sort the "apple" results by the "importance". But we don't want to promote a "popular" unrelated content because it has a "apple" in the content. Simply sorting the result by popularity/importance is not a good solution.

veloman-yunkan commented 1 year ago

@rgaudin @mgautierfr I am opposed to the idea of embedding in ZIM files transient/dynamic/target-audience-dependent information like popularity, importance or whatever else one may call it. A first idea is to package that data as an addition/overlay to a ZIM file, so that the same ZIM file can be fine-tuned for different applications or user-bases (e.g. children, teachers, hikers, scientists in Antarctica or on the International Space Station, etc).

mgautierfr commented 1 year ago

My idea is to add a "usage neutral" importance in the xapian database which would help xapian to "correctly" sort the results. I'm against changing the zim format to store this information.

veloman-yunkan commented 1 year ago

I'm against changing the zim format to store this information.

I didn't mean changing the ZIM format. My proposal was to have one or more separate files (similar to external subtitles) augmenting the ZIM content with popularity information.

kelson42 commented 1 year ago

I would like to avoid discussion if the feature request is pertinent because there is no other ticket open to propose a solution to improve suggestion/search pertinence. If someone has a better idea, please open a ticket.

I would also like to avoid the discussion to decide if what we put as coefficient makes sense. This is the role of the publisher with the scraper dev to decide this.

mgautierfr commented 1 year ago

I would like to avoid discussion if the feature request is pertinent because there is no other ticket open to propose a solution to improve suggestion/search pertinence. If someone has a better idea, please open a ticket.

There is no other ticket open to propose a solution to improve suggestion/search because not issue has been open discussing the issue without assuming a solution.

The problem can be described as : Some articles may match a search query but the sorting order is discutable.

Proposing a solution (add a popularity value) instead of describing the issue will de facto create a situation where no other ticket propose something else. But it doesn't mean we have to do the first solution proposed (and doesn't mean we don't have to do it neither)

I would also like to avoid the discussion to decide if what we put as coefficient makes sense. This is the role of the publisher with the scraper dev to decide this.

We can be sure that we will be excluded from this discussion but we still will have to fix future indexation problem :) I prefer to discuss relevance of improvement before giving it to user and realize it is not the good solution.


In fact, your comment make me take a step back and search a bit about how wikipedia searches and the popularity concept (yes, your comment had the inverse infect that it wanted) It seams that wikipedia sorts result by "relevance" by default. "Relevance" is not well defined but from https://www.mediawiki.org/wiki/Help:CirrusSearch#Explicit_sort_orders it is A relevance sort taking into account many features of the document. (On our side, it seems we are using some kind of just_match order). But I haven't found in my search one information suggesting that popularity (or any static order) is used here.

About the search on Apple itself : Wikipedia search engine (https://en.wikipedia.org/w/index.php?fulltext=1&search=apple&title=Special%3ASearch&ns0=1) return the page Apple (fruit) first and Apple_Inc (the company) seconde. Our search engine (https://library.kiwix.org/viewer#search?books.name=wikipedia_en_all_maxi_2023-02&pattern=apple) give Apple_Inc 9th and Apple 18th. The first one being List_of_songs_recorded_by_Fiona_Apple. Obviously something is wrong on our side.

But if you look the popularity of the firsts pages with https://pageviews.wmcloud.org/?project=en.wikipedia.org&platform=all-access&agent=user&redirects=0&range=latest-20&pages=Apple|Apple_Inc.|Apple_(disambiguation)|Apples_to_Apples|Apple_Mac_OS_X|MacOS You can see that the wikipedia sorting doesn't follow the popularity: MacOs (through Apple_Mac_OS_X) is higher than Apple_Inc which is higher tan Apple. If the purpose is not "promote apple product to young children" (which is discutable), the popularity doesn't seems a good criteria.

And popularity is really dependent of the current context. Today Putin has earn a lot of views with the invasion of Ukraine (this is not my words, but the ones in https://en.wikipedia.org/wiki/Wikipedia:Popular_pages#Political_leaders). I'm not sure we want to sort content which will stay for long with contextual information.

But there is another way we can improve this without adding a new feature. From this https://www.mediawiki.org/wiki/Structured_Data_Across_Wikimedia/Section_Topics/Data_Pipeline#General_architecture we can see that one step is to "Filter non-informative sections" and "Filter noisy items". We could do the same before adding new criteria we don't know how to handle. For exemple scrappers may start to use getIndexData (we discuss about it in https://github.com/openzim/libzim/issues/377 for example) to provide better content indexation (and it is exactly the purpose of the getIndexData feature). We know since a long time that our indexation process is not good. We should fix it first before adding new feature. For example, the second article we return for apple (https://library.kiwix.org/content/wikipedia_en_all_maxi_2023-02/A/String_interpolation) is not at all on apple but has a lot of exemples with apple. If the scrapper would remove the code example, we would remove it from the apple results.

We could also put in first result article with the exact title. It would put easily Apple article first without changing the indexation process at all.

Other possibility would be to count how many links point to an articles (which is more how ranking is historically working).

Today (and tomorrow also), only wikipedia has the popularity information. It means that the popularity factor will be used only for one website. It will not help the global indexing/search engine and in fact, it will interfere with it making debug really difficult ("Ranking in wikipedia zim not good" -> scrapper problem). I don't say that we should not make it, but I WANT to discuss it first.

kelson42 commented 1 year ago

Today (and tomorrow also), only wikipedia has the popularity information.

This is very very wrong, false for:

... and nobody knows the future

holta commented 1 year ago

Some educators (and librarians, and medical professionals) believe viral popularity is the problem — rather than the solution 😄 Not everyone agrees with them of course (teen-age TikTok addicts especially) 😉

Either way, these criticisms of "viral popularity" (as demagogic if not mindless mob rule) are just 1 more reminder that Quality/Relevance indicators of {offline search results, offline content, etc} are so urgently needed by all "offline" communities (perhaps our most painfully hard design challenge!?) ✅