searxng / searxng

SearXNG is a free internet metasearch engine which aggregates results from various search services and databases. Users are neither tracked nor profiled.
https://docs.searxng.org
GNU Affero General Public License v3.0
11k stars 1.18k forks source link

Add an option to search for content within specified lists of websites #2507

Open LeoBoudet opened 1 year ago

LeoBoudet commented 1 year ago

Hello!

I have noticed when I am looking for something on the search engines that the results can be polluted by numerous websites whose only quality is to have a good SEO but no relevant information to share.

In order to solve this problem, that many of us face apparently, I thought about a way of creating lists of resources that a user would have previously selected to be browsed.

Let's say I am working on a biology article and I already know several web resources that I trust and that I would like to browse easily. I could simply select my custom list called "biology" and type my requests directly to be browsed within the resources named on that list.

I have looked around for a while to find such a feature but didn't manage to find anything so I am proposing it here, because I think that projects like SearXNG are some of the most obvious to develop such features.

I've only find a blocking list request here: https://github.com/searx/searx/issues/2001

If you are interested, you can find my complete reflection on that matter in this article, that extends this idea to other features: https://synergeticdesign.substack.com/p/software-overlay-for-metasearch-engines

If you already know a tool to do so, please share it in the comments, I am sure it would help many people.

Thank you for your time,

Léo

allendema commented 1 year ago

Difficult cuz searxng has no index but:

Option 1:

Use an upstream engine which supports domain whitelisting/ranking like mojeek, brave. [0] Under settings.ymlcopy base upstream engine config, rename it and add you wanted domains. This step can be repeated.

Working mojeek config with whitelisted domains for e.x python (base mojeek + added foc param and cookies) ```yaml - name : mojeekpython shortcut: mjkpy engine: xpath paging : True search_url : https://www.mojeek.de/search?q={query}&s={pageno}&lang={lang}&lb={lang}&foc=python results_xpath: //ul[@class="results-standard"]/li/a[@class="ob"] url_xpath : ./@href title_xpath : ../h2/a content_xpath : ..//p[@class="s"] suggestion_xpath : //div[@class="top-info"]/p[@class="top-info spell"]/em/a first_page_num : 0 page_size : 10 disabled : True weight: 1.1 display_error_messages: False cookies: foc_python: i=python-forum.de,python.org,realpython.com,stackoverflow.com&e=pinterest.com,www.w3schools.com,www.ionos.de ```
Used-to-work previously Brave config with whitelisted domains for e.x tech blogs [https://github.com/searxng/searxng/compare/master...allendema:searxng:brave-goggles](https://github.com/searxng/searxng/compare/master...allendema:searxng:brave-goggles) [https://github.com/searxng/searxng/commit/12d9de269825759eb3ce15f22357f39968e2c513](https://github.com/searxng/searxng/commit/12d9de269825759eb3ce15f22357f39968e2c513)

Option 2:

Requires changes to javascript/html code:

[0] https://search.brave.com/help/goggles [1] https://www.mojeek.com/focus/dashboard

LeoBoudet commented 1 year ago

Thank you for your insights. The option 2 is what I had in mind. I see that Mojeek and Brave Search offer some possibilities of fine-tuned browsing but it is still a bit more technical than what I was thinking of. But it is good news as it means that the need has been indentified and that we should see more user-friendly solutions in the future for this feature.

Would performing a post-search filtering be difficult for SearXNG? I get that it has no index but I was thinking of a narrow-down type of script to ensure delivering the expected results (from a whitelist, blacklist or weighted list).

Raphencoder commented 10 months ago

I have the almost same need that @LeoBoudet ! Is there any new solution or workaround ? My idea is to deploy a custom searx for users that will only contain specific domain result (like only wikipedia.org and other website) for them, with custom filters they can select etc.. @allendema or @return42 any idea how to reach this goal ? Thanks for this excellent project