wabarc / wayback

An archiving tool with an IM-style interface that prioritizes privacy and accessibility, integrated with various archival services including Internet Archive, archive.today, Ghostarchive, IPFS, Telegraph, and file systems.
https://docs.wabarc.eu.org
GNU General Public License v3.0
1.75k stars 65 forks source link

archive.today is unavailable #92

Open hemind opened 3 years ago

hemind commented 3 years ago

Bug Report

Current Behavior When running wayback command to archive some web page, I got such error message.

html to archive.today failed: archive.today is unavailable.

Environment

Possible Solution

Now archive.today redirects to archive.ph. Maybe we should also use domain archive.ph?

waybackarchiver commented 3 years ago

@hemind Thank you for your reporting. Wayback to archive.today has reached all the domains, will also access the onion service if exists a tor bundle. This case may be unavailable caused by CAPTCHA, and there is a solution currently.

If you have other more dependable solutions, please feel free to suggest.

ahxxm commented 2 years ago

i got url like http://archive.today?url={{submitted-url}}

the message suggests that bot failed to connect tor service, is this related?

arc_1  | [2021-10-22T02:34:20] [DEBUG] [tor.go:98:useProxy] Try to connect tor proxy failed: dial tcp 127.0.0.1:9050: connect: connection refused
arc_1  | Oct 22 02:34:20.320 [warn] Tor was compiled with zstd 1.4.5, but is running with zstd 1.4.9. For safety, we'll avoid using advanced zstd functionality.
arc_1  | Oct 22 02:34:20.000 [warn] You are running Tor as root. You don't need to, and you probably shouldn't.

the message is from a container running docker image with

waybackarchiver commented 2 years ago

@ahxxm Thanks for your feedback!

When sending a request to archive.today, the package archive.is will try to access the archive.today's onion service via tor proxy and will start a temporary tor proxy if port 9050 can not be connected.

We can see that starts a temporary tor proxy from the logs

arc_1  | Oct 22 02:34:20.320 [warn] Tor was compiled with zstd 1.4.5, but is running with zstd 1.4.9. For safety, we'll avoid using advanced zstd functionality.
arc_1  | Oct 22 02:34:20.000 [warn] You are running Tor as root. You don't need to, and you probably shouldn't.
ahxxm commented 2 years ago

@waybackarchiver but i still get url like http://archive.today?url={{submitted-url}}, which requires manual submission(usually after dealing with annoying cloudflare captcha).

It seems that current torrc disables 9050 by set SOCKSPort 0? According to sample torrc comments:

## Tor opens a SOCKS proxy on port 9050 by default -- even if you don't
## configure one below. Set "SOCKSPort 0" if you plan to run Tor only
## as a relay, and not make any local application connections yourself.
ahxxm commented 2 years ago

I tried the following:

still failed to submit to archive.today, the debug log says connected to 9050, but it starts another tor anyway?

arc_1  | [2021-10-23T06:55:12] [DEBUG] [tor.go:103:useProxy] Connected: 127.0.0.1:9050
arc_1  | Oct 23 06:55:12.507 [warn] Tor was compiled with zstd 1.4.5, but is running with zstd 1.4.9. For safety, we'll avoid using advanced zstd functionality.
arc_1  | Oct 23 06:55:12.000 [warn] You are running Tor as root. You don't need to, and you probably shouldn't.

ah sorry, I assumed that the onion address won't require captcha, it turns out ARCHIVE_COOKIE=cf_clearance= is still needed.

How often does this cookie expire?

waybackarchiver commented 2 years ago

@ahxxm Thanks for your reporting!

It seems that current torrc disables 9050 by set SOCKSPort 0?

Yet, SocksPort should be set to 9050, I will update it later.

still failed to submit to archive.today, the debug log says connected to 9050, but it starts another tor anyway?

Actually, this log starts a tor from wayback/wbipfs, it is controlled by the --tor.

It looks like that needs to be optimized.

How often does this cookie expire?

For the captcha, I can't determine its expiration time.

ahxxm commented 2 years ago

turns out that its expiration time is quite short how would you like implementing paid service APIs(along with privacy-pass plugin in headless browser, which can make one recognization "last" longer)

waybackarchiver commented 2 years ago

turns out that its expiration time is quite short how would you like implementing paid service APIs(along with privacy-pass plugin in headless browser, which can make one recognization "last" longer)

Privacy Pass is currently supported by Cloudflare to allow users to redeem validly signed tokens instead of completing CAPTCHA solutions. privacypass/challenge-bypass-extension

Unfortunately, it appears that only hCaptcha is currently Privacy Pass compatible, while the annoying reCAPTCHA is not. Aside from that, I'm aware of the following options.

waybackarchiver commented 2 years ago

The dessant/buster approach is possible, and we will focus our attention next on developing a similiar strategy.

hellodword commented 2 years ago

puppeteer-extra-*

I'd like to use go, but puppeteer(nodejs) really has good ecosystem about bypassing these stuff.

So how about building a tool that can convert its plugins so we can use it in go?

waybackarchiver commented 2 years ago

puppeteer-extra-*

I'd like to use go, but puppeteer(nodejs) really has good ecosystem about bypassing these stuff.

So how about building a tool that can convert its plugins so we can use it in go?

It's a fantastic idea, but I'm not sure how possible the approaches for implementing it are.

waybackarchiver commented 2 years ago

puppeteer-extra-*

I'd like to use go, but puppeteer(nodejs) really has good ecosystem about bypassing these stuff. So how about building a tool that can convert its plugins so we can use it in go?

It's a fantastic idea, but I'm not sure how possible the approaches for implementing it are.

Running Chrome with Xvfb and then reaching a shared goal via an extension might be a possible solution.

ahxxm commented 2 years ago

https://blog.cloudflare.com/friendly-bots/

waybackarchiver commented 2 years ago

https://blog.cloudflare.com/friendly-bots/

Thank you for your sharing. This is important news for us, and we will submit the form as soon as possible.

waybackarchiver commented 2 years ago

The proposal is partial, but it needs extra test scenarios, so anyone is prepared to do so is welcome to download the binaries in this workflow runs for testing.

waybackarchiver commented 1 year ago

We found that even when using a proxy, the default Golang client was unable to pass CAPTCHA. After troubleshooting, we confirmed that the issue was related to TLS fingerprinting, and so we added a client using uTLS that allows for the specification of TLS fingerprints.

Now we have successfully addressed the issue by using a network proxy as a new solution, such as Cloudflare WARP.

To use a proxy like Cloudflare WARP, follow these steps:

  1. Sign up for a Cloudflare account.
  2. Obtain WARP credentials.
  3. Launch a Wireguard proxy.
  4. Export http_proxy and https_proxy.

References