kiwix / kiwix-tools

Command line Kiwix tools: kiwix-serve, kiwix-manage, ...
https://download.kiwix.org/release/kiwix-tools/
GNU General Public License v3.0
407 stars 79 forks source link

Remove OPDS "notag" parameter #440

Open kelson42 opened 3 years ago

kelson42 commented 3 years ago

The problem with this notag parameter is that it complexifies the requests and does not solve the fullproblem. For example it is impossible to make a request with ZIM files having tag1 OR tag2... and this seems to be a pretty legitimate and useful request.

Considering that we plan to use an in-memory Xapian DB to search in description/title, see https://github.com/kiwix/kiwix-lib/issues/106, I wonder if we should not put as well the tags/categories in it and allow to make search via simple Xapian operators https://xapian.org/docs/queryparser.html?

mgautierfr commented 3 years ago

"notag" argument of a query on the opds service is independent of how we implement the library database (xml, xapian or whatever).

The "notag" query is used in kiwix-desktop to get the "other" category. kiwix-desktop doesn't know the exact list of categories available and use a static list. So when we want to filter zim file to display all zim not in the "known categories" we have no option than exclude zim files in the categories we know. If we have a way to know all categories(#318), we can do a filter to only include zim files in the categories we don't display to the user.

kelson42 commented 3 years ago

@mgautierfr Thx for clarifying the usage of notag. To me:

kelson42 commented 3 years ago

@veloman-yunkan Can we move on with this ticket? Maybe we can keep the "notag" parameter for a bit of time to keep backward compatibility, but behind the scene this should work with XAPIAN operators.

veloman-yunkan commented 3 years ago

@kelson42 I propose to open a new ticket for utilizing Xapian for all other book fields too (currently, only title and description are indexed). This ticket can be closed together with that one.

veloman-yunkan commented 3 years ago

@kelson42 I propose to open a new ticket for utilizing Xapian for all other book fields too (currently, only title and description are indexed). This ticket can be closed together with that one.

kiwix/kiwix-lib#484 was opened as proposed above

kelson42 commented 3 years ago

@veloman-yunkan my bad, forgotten about your comment. perfect.

kelson42 commented 2 years ago

@veloman-yunkan https://github.com/kiwix/libkiwix/issues/484 and https://github.com/kiwix/kiwix-tools/issues/318 have been implemented but it seems that the notag parameter is still there and this is actually not obvious to me how to achieve to get a similar restult as the following request with the tag= OR/AND category=filters (is that possible at all?!).

https://library.kiwix.org/catalog/search?lang=eng&count=-1&notag=_category:gutenberg;_category:mooc;_category:phet;_category:psiram;_category:stack_exchange;_category:ted;_category:vikidia;_category:wikibooks;_category:wikihow;_category:wikinews;_category:wikipedia;_category:wikiquote;_category:wikisource;_category:wikiversity;_category:wikivoyage;_category:wiktionary

veloman-yunkan commented 2 years ago

@kelson42 Preserving only two notag values from your example, the query should be

https://library.kiwix.org/catalog/search?q=lang:eng%20-tag:_category:gutenberg%20-tag:_category:mooc'

(the value of the q parameter with URL encoding removed is lang:eng -tag:_category:gutenberg -tag:_category:mooc).

However the : symbol in the tag values confuses the Xapian query parser, which utilizes that punctuation mark for separating the field name from the field value. For tags not containing the colon character, that query works OK.

mgautierfr commented 2 years ago

I somehow disagree with the idea of exposing the xapian request format in the API. Xapian is a implementation details. Filtering was made without xapian before and it may change in the future. If we allow the user to pass a plain xapian query, then it is part of the API and it is no more a implementation choice.

We must define a search API and provide it to the user/client. If we use xapian internally, we have will have to transform the search query (from our API) to a xapian query. If we decide to use something else, the API will not change.

kelson42 commented 2 years ago

@mgautierfr OK in principle, but pragmaticaly, this seems not realistic. I don't want us to mockup/reimplement somehow the full operator parsing... and we need these operators right?

mgautierfr commented 2 years ago

I don't know if we need them. It would be good to define what we need to support first.

The OPDS stream is mainly used by clients (program) to know the list of available zim and (potentially) filter them to display subset to the user. There is three use cases :

On the OPDS side, we apply filtering on xapian request (q), maximum size (maxsize), name (name) , category (category), language (lang), tag (tag) and excluded tag (notag)
The q parameter is plain passed to xapian, which allow to search on title, description, name, category, lang, publisher, creator and tag (IF the client developer knows how xapian db is constructed and requested). This is available because the server parse the OPDS request, and use a "local" Library and Filter to implement the filtering. The Library/Filter is what is also use in local filtering (and allow more that what is done in OPDS request)

We extend the search parameters with kiwix/libkiwix#459 and provide a way to construct a OPDS request with kiwix/libkiwix#527 but we never update the kiwix-desktop side and so we still construct the request ourselves and we still use the tag=_category:Foo instead of category=Foo

It seems to me that all those features are more a organic evolution and the combination of different features not especially design as a whole.

I would propose :

It should not be so complex to implement, just check for the ~ at the beginning of the key and split the value with |. The ~key=foo|bar is a bit more complex but we can remove it from the API (I'm not sure we need it, I cannot find a use case)

kelson42 commented 1 year ago

@mgautierfr I think your proposal covers current needs but I'm 100% supportive because:

I propose something alternative:

mgautierfr commented 9 months ago

Tends to create a lots of URL parameters (all the NOT, and multiple of the same if we want an AND operation beetween them)

Your solution only have one parameters indeed. But the only one paramater will contains all the query and will include &/AND itself and such. I don't see a real improvement here.

Understand the overal request (to review or build) is easy because to a large extend implicit and hidden (in the code)

(I suppose a not is missing here) I disagree here. The current implementation is having a lot of implicit (and inconsistent behavior) (See https://github.com/kiwix/kiwix-desktop/pull/965#issuecomment-1715316561, we have lang=foo,bar which is a OR and category=foo,bar wihch is a AND)(This has been fixed) And ?lang=en|fr&~category=wikipedia&tag=video&tag=foo|bar&~tag=baz is not less readable than ?xapian_query=lang:(en OR fr) AND NOT category:wikipedia AND tag:video AND tag:(foo OR bar) AND NOT tag:baz

Does not allow mixing of parameters with logical operator

I'm not sure do understand what is the need here. My solution provide logical operator, so you must think about something specific

Does not allow complex operation with other operators, like parenthesis for example

Indeed. Do we really need it ?

Does not really scale easily if in the future we want to allow more complex requests

Difficult to say which future complex requests I suppose. But I'm not sure we will support and need querying a subset of the available book by applying a set of filter.

We need a bit of code to fully reassemble the multiple URL query string values

You will need a bit of code to fully reassemble the multiple filter in a xapian query string.

This would be pretty future proof, because almost no logic between URL representation of the query and Xapian representation

Indeed. But I really dislike exposing an API from our dependency as our main API. If for some reason we need/want to move out of xapian, we are stuck.

The mapping from my solution to the xapian query is pretty easy :

(Of course, we may not use simple string replace and use high level object to do the conversion, we already have a Filter class which parse a query string and create a xapian query)

rgaudin commented 9 months ago

I like this proposal because it covers a lot with minimal effort (I think!), acknowledging that it has known limitations. It's a good way to bring clarity, readability and flexibility to existing use cases (should they all be met).

Maybe all current OPDS API users could list their current and expected use cases (and query format). That would help ensure there's no dead spot, and could serve to write unit tests.

If all are met with this and it's as easy as it sounds to implement, it could be a good solution for now (my opinion!)