searx / searx

Privacy-respecting metasearch engine
https://searx.github.io/searx/
GNU Affero General Public License v3.0
13.36k stars 1.71k forks source link

query negations are sometimes ignored #1506

Open ghost opened 5 years ago

ghost commented 5 years ago

Searching this instance for "macaroni and cheese recipe -oven -baked" yields in the 4th slot:

"Baked Macaroni and Cheese Recipes - Southern Living" (DuckDuckGo)

The large search engines seem to all drift in this direction of thinking they know better than the user and ignore users' explicit negations. It is wrong to do so, as it insults the user's intelligence and gives bad results. I'm not sure if Searx is simply trusting the results of the crawlers but it should not. Searx should remove the non-matching results.

atomGit commented 5 years ago

i kind of agree - this is one reason why i never enable Bing because that engine ignores various operators, including "phrase searches" and -negative keywords

dalf commented 5 years ago

It is a difficult problem to solve.

Current implementation : the raw text is sent to the different engines. The thing is that google may think that negation should be ignored for a specific query, and duckduckgo won't.

I would recommend to use only one engine at a time in case of complex query (because each engine has his own syntax and way to "understand" complex queries).

One slippery way to fix it : detect the negation in searx, and disable engine that are known to ignore the negation.

ghost commented 5 years ago

I've not looked at the searx code so I'm not sure why it would be a difficult problem to solve. Wouldn't it be trivial to grep the results for negated words and omit them? It would miss the cases where a negation appears in the full article but not the abstract portion of the search result, but it might still trivially eliminate a good number of false positives.

If we look at this from a rule of least astonishment angle and assume that users know searx is a metasearch, then users would naturally expect searx to ignore obviously mismatched results in the data it already handles, but not necessarily to do a deep inspection of every result.

A quite advanced version of Searx could even have query mismatches be detected by the user after a full page is retrieved and report back to create reliability metrics that improve searx... but maybe that's getting carried away.

atomGit commented 5 years ago

one possible solution to this is using a tabbed interface to display the results of each engine separately as suggested here - going into settings every time the user requires something specific is not optimal

ghost commented 5 years ago

i'm not sure how a tabbed interface solves the problem of ignored negations. If you're suggesting that the user should still be able to see ignored negation results, then in terms of UI the same approach used by searxes.danwin1210.me to handle CloudFlare sites would be useful. That is, results are given but with a strikethrough. And in the onion version, the undesirable results are folded into a drop-down folder that expands at the bottom.

Or if it were to be a tabbed UI, wouldn't it be better to have two tabs: one for conforming results and the other for non-conforming results?

atomGit commented 5 years ago

i'm not sure how a tabbed interface solves the problem of ignored negations.

it doesn't, granted, but at least it isolates them? - also this topic wasn't my sole nor primary reason for suggesting a tabbed UI, but yes, i agree, it's not a complete solution