mediacloud / web-search

Code that drives the public web-based tools for the Media Cloud Online News Archive and Directory.
https://search.mediacloud.org
Apache License 2.0
10 stars 15 forks source link

Investigate better approaches for the domain filter and url_search_string filter #832

Open pgulley opened 1 month ago

pgulley commented 1 month ago

At the very least, we discovered that the interpretation of "field_name:some_string" in the query string defaults to a "contains" not "is equal to" expression, so we are potentially overmatching on canonical domain (depending on how elasticsearch's tokenizer interprets these string)- and the wildcard might be totally redundant. Or it might not be. An afternoon spent poking at it in kibana would quickly reveal the truth. Either way, the escaping we're doing now is totally redundant and might be impacting search results as well.

Related: Should constructing the filter search strings even happen in the web_search? That feels like something we should be handling in the news-search-api. At the very least, it feels like a better developer pattern would be to expect to make updates to the NSA when we wanted to change the syntax of our queries.

philbudne commented 1 month ago

I wonder if it all belongs in mc-providers, and news-search-api should go away?

Also: Do today's revelations give us the possibility that url_search_strings CAN be done on IA??