koniu / recoll-webui

web interface for recoll desktop search
266 stars 55 forks source link

Different results in webui and recoll #2

Closed wonx closed 10 years ago

wonx commented 11 years ago

I've been testing the webui for a while, and I realised that the search results differ from what I get in recoll. I guess that recoll-webui doesn't use language stemming like recoll does.

These are the differences between the two queries. First recoll's query:

Query details: (((neuropsychological OR neuropsychology:(wqf=11) OR neuropsychologic OR neuropsychologically OR neuropsychologi OR neuropsycholog) AND (alcohol:(wqf=11) OR alcoholism OR alcoholics OR alcoholicos OR alcoholismo OR alcoholic OR alcoholica OR alcoholico OR alcoholicas OR alcohola OR alcoholic's))) 

(note that my recoll's query includes both english and spanish words, because the two languages are configured in recoll.)

And this is recoll-webui's query:

results?query=neuropsychology+alcohol&dir=<all>&after=&before=&sort=relevancyrating&ascending=0&page=1

Also, I feel that the results should be sorted by relevance by default (instead of being sorted by date)

koniu commented 11 years ago

The WebUI fully relies on the Recoll Python API for querying/retrieving results and so the result list should be the same. However, I have a faint memory of medoc (Recoll dev) saying that the API module has the stemming language hard-coded to English (quick look at the source code seems to confirm that) which would be the core of this issue. I have a feeling this might have been recently fixed in Recoll and will no longer be an issue with the next release. I will investigate further whenever I get some spare time.

wonx commented 10 years ago

Any news on this issue? Has there been any change in the API during this time?

koniu commented 10 years ago

Sorry for leaving this issue so long. There has been some changes in the python API and stemming language is no longer hardcoded, so something can definitely be done.

medoc, reckon you can shed some light? I see that Query.execute() takes 'stemlang' arg now, but I'm not sure what the allowed values are. Is this something that can be pulled out of the recoll.conf? Looking at pyrecoll.cpp it seems that it defaults to english. How does the native interface handle this? I see in the GUI there's a drop-down in the "GUI config > search params" that allows to pick stemming language. Options I can see are: "no stemming", "all languages" and "english". How would I get such list and how to do "all languages" through the python api?

I'd like to make webui more robust in this area, as well as make webui's default behaviour the same (or v.similar) as the native interface, but I'm not sure what is the way to go about it, or even what the options are.

On Tue, Oct 22, 2013 at 11:59 AM, wonx notifications@github.com wrote:

Any news on this issue? Has there been any change in the API during this time?

— Reply to this email directly or view it on GitHubhttps://github.com/koniu/recoll-webui/issues/2#issuecomment-26790372 .

ghost commented 10 years ago

You can get the stemming languages which are configured in the index by retrieving the "indexstemminglanguages" from the configuration. The default value is "english". This can be a list of several values, like "french english german"

By the way, there is now a rclconfig.py file in the recoll source which would be a better way to access the configuration, it will properly handle the defaults and is supposed to work in every way like the c++ code (if it doesn't, I'll fix it). It's included with recent recoll releases, but to work with older releases, maybe it would be better to keep a copy with the webui.

It's basically (ignoring the appropriate import statements and module names as they will depend on how you do things):

rclconf = RclConfig()
stemlangs = rclconf.getConfParam("indexstemminglanguages")

There are other methods, not much commenting, but I think it's quite obvious what they do, else, just ask me.

You can also retrieve the list of all stemming languages supported by the Xapian stemmer by typing "recollindex -l", but I don't think that it is of much interest for the webui, they have to be configured into the index to be useful.

koniu commented 10 years ago

Sweet, rclconfig.py is the way forward for sure - nice one! Question is indeed of where it goes.

The module definitely belongs with the rest of the python api but leaving it at that and adapting webui to use it would indeed break compat for a lot of distros. I could just copy the file into webui repo but that creates an untracked copy (not linked to the upstream) but that adds to maintenance and can cause issues should changes occur upstream. Maybe something in between where I keep a copy for the time being but remove it in the future, whenever those newer versions have been around for a while and perhaps it's time for another recoll-version-based branch of webui.

Btw, I was trying to figure out which releases have rclconfig.py and I looked on your bitbucket but couldn't figure out how commits correspond to releases. Could do it by date but couldn't even find the dates for individual minor releases. Any insight?

On Tue, Oct 22, 2013 at 2:14 PM, medoc92 notifications@github.com wrote:

You can get the stemming languages which are configured in the index by retrieving the "indexstemminglanguages" from the configuration. The default value is "english". This can be a list of several values, like "french english german"

By the way, there is now a rclconfig.py file in the recoll source which would be a better way to access the configuration, it will properly handle the defaults and is supposed to work in every way like the c++ code (if it doesn't, I'll fix it). It's included with recent recoll releases, but to work with older releases, maybe it would be better to keep a copy with the webui.

It's basically (ignoring the appropriate import statements and module names as they will depend on how you do things):

rclconf = RclConfig() stemlangs = rclconf.getConfParam("indexstemminglanguages")

There are other methods, not much commenting, but I think it's quite obvious what they do, else, just ask me.

You can also retrieve the list of all stemming languages supported by the Xapian stemmer by typing "recollindex -l", but I don't think that it is of much interest for the webui, they have to be configured into the index to be useful.

— Reply to this email directly or view it on GitHubhttps://github.com/koniu/recoll-webui/issues/2#issuecomment-26797573 .

ghost commented 10 years ago

koniu writes:

Sweet, rclconfig.py is the way forward for sure - nice one! Question is indeed of where it goes.

The module definitely belongs with the rest of the python api but leaving it at that and adapting webui to use it would indeed break compat for a lot of distros. I could just copy the file into webui repo but that creates an untracked copy (not linked to the upstream) but that adds to maintenance and can cause issues should changes occur upstream.

rclconfig.py is not going to change a lot, so the maintenance load for updating the local copy from time to time should be minimal.

I also quite believe that no conflict could arise from having a local copy, because the recoll python API currently does not use it at all (it's pure C), and the module itself is completely standalone. If you carry it in a subdirectory (something like rclut), and import it from there, I really can't see how this would cause problems. I'm not even sure that this is necessary, but it seems the safe way.

Maybe something in between where I keep a copy for the time being but remove it in the future, whenever those newer versions have been around for a while and perhaps it's time for another recoll-version-based branch of webui.

Yes to local copy. I don't think that another branch is necessary.

Btw, I was trying to figure out which releases have rclconfig.py and I looked on your bitbucket but couldn't figure out how commits correspond to releases. Could do it by date but couldn't even find the dates for individual minor releases. Any insight?

It appears that it is installed since 1.19.5. It is used by the Ubuntu Lens, which tests for it and adjusts itself depending if it's there or not.

Trying to import the module from recoll and using the local copy if the recoll one is not available should not create too much of a performance issue, and would avoid having to maintain several branches or relying on version numbers.

Or just use the local copy, and remove it one day when all possible recoll versions will have it. I would not hold my breath though, as some people still use recoll 1.16 (and even 1.13 sometimes).

Cheers,

jf

koniu commented 10 years ago

wonx: can you try this revision to see if that's fixed now? if not feel free to re-open :)

wonx commented 10 years ago

I just tried it. It seems to work, now the results from the web-ui are identical to those from the recoll gui. Thanks!

But (there's always a "but"), search terms which do not match exactly with the same word aren't highlighted (and they are in Recoll). For example, if I search for "alcohol", the word "alcoholism" should be highlighted too. It would be nice if this feature was added, although it's not really important.

ghost commented 10 years ago

In recent Recoll versions (from 1.18.2 I think), the Query class has a makedocabstract() method which can perform proper highlighting. See http://www.lesbonscomptes.com/recoll/usermanual/RCL.PROGRAM.API.html#RCL.PROGRAM.API.PYTHON

The Query method replaces and augments the Db one (which still exists).

koniu commented 10 years ago

Highlighting should now follow the stemming thanks to medoc :)