searxng / searxng

SearXNG is a free internet metasearch engine which aggregates results from various search services and databases. Users are neither tracked nor profiled.
https://docs.searxng.org
GNU Affero General Public License v3.0
11.21k stars 1.2k forks source link

Bug: archive.is engine Timeout / blocked by CAPTCHA #2643

Closed bonswouar closed 11 months ago

bonswouar commented 11 months ago

Version of SearXNG, commit number if you are using on master branch and stipulate if you forked SearXNG Repository: https://github.com/searxng/searxng Branch: master Version: 2023.8.8+bcaaae699

How did you install SearXNG? searxng-docker

What happened? Tried to use archive.is engine, but it always timeout

How To Reproduce Search for "!ai samsung.com"

Expected behavior Shouldn't timeout

Additional context My Searxnng instance is on a dedicated server. But I notice I also struggle navigating to archive.is directly : with my home connection (using a shared Starlink IP) it seems to infinite loop on the captcha page. But no problem using cell network.. So I guess they might have weird IP restrictions or something?

Technical report

Error

Error

unixfox commented 11 months ago

Related information: https://old.reddit.com/r/DataHoarder/comments/13g4htv/cloudflare_dns_blocking_archiveis/

bonswouar commented 11 months ago

@unixfox thanks for the link it's interesting! Although it seems my issue isn't exactly the same, I can nslookup/ping archive.is successfully (using Hetzner's dns server apparently hm) But they probably just have different types of IP restrictions..

return42 commented 11 months ago

Sadly archive.is is blocked by a CAPTCHA and I don't have a clue how we can avoid this CAPTCHA

image

return42 commented 11 months ago

@bonswouar I'm sorry, but we do not have a solution for this issue at hand .. in #2645 I will drop the XPath configuration for this engine (makes no sense to hold a configuration, that do no longer work).

The merge of #2645 will close this issue .. if this search engine is very important for you, you would have to open an engine request ... maybe there is someone who can implement a python module which is able to bypass the CAPTCHA problem (if there is a way to bypass). Maybe you can already make suggestions, any support is welcome.

I am sorry that we can not do more at present ..

bonswouar commented 11 months ago

@return42 No worries I totally understand! And unfortunately after the few tests I did on my side it seems you're right, there is no easy way to bypass this captcha (it seems to depend too much of the IP)

But who knows, maybe at some point they'll revert some of those restrictions when they see how problematic it can be (I actually can't use the website at all form my personnal browser & IP)