seanmonstar / reqwest

An easy and powerful Rust HTTP Client
https://docs.rs/reqwest
Apache License 2.0
9.88k stars 1.12k forks source link

reqwest is blocked (429) on reddit.com #1353

Closed kevincox closed 2 years ago

kevincox commented 3 years ago

I understand that this likely isn't a reqwest bug, and there may be nothing to be done on the reqwest side, but I figured at the very least it would be useful to have this show up if someone searches the issue tracker. Feel free to close if you don't want to take any action or track this problem.

Reddit denies all requests with 429. I'm fairly sure this isn't actual rate limiting unless we are falling into a strange bucket that is always empty. The error message mentions my IP but requests from curl from the same IP with the same headers (as far as I can tell) succeed. I'm not sure how Reddit knows that the request is comping from reqwest, curl requests with the same settings succeed. I've having trouble nailing down how Reddit identifies reqwest but can't seem to get low level detail from reqwest and TLS makes it difficult to just use strace.

#[tokio::main]
async fn main() {
    let client = reqwest::Client::builder()
        .user_agent("some-unique-app.kevincox.ca/1")
        .connect_timeout(std::time::Duration::from_secs(60))
        .timeout(std::time::Duration::from_secs(600))
        .build().unwrap();

    let res = client.get("https://www.reddit.com")
        .header("accept", "*/*")
        .send().await.unwrap();

    dbg!(res.status());
    dbg!(res.headers());
    dbg!(res.text().await);
}
Cargo.toml ```toml [package] name = "tests" version = "0.1.0" edition = "2018" [dependencies] reqwest = "0.11.5" tokio = { version = "1.12.0", features = ["rt", "rt-multi-thread", "macros"] } ```
% cargo run
    Finished dev [unoptimized + debuginfo] target(s) in 0.03s
     Running `target/debug/tests`
[src/main.rs:12] res.status() = 429
[src/main.rs:13] res.headers() = {
    "connection": "close",
    "content-length": "234485",
    "retry-after": "0",
    "content-type": "text/html",
    "accept-ranges": "bytes",
    "date": "Thu, 14 Oct 2021 15:13:34 GMT",
    "via": "1.1 varnish",
    "cache-control": "private, max-age=3600",
    "strict-transport-security": "max-age=15552000; includeSubDomains; preload",
    "server": "snooserv",
    "x-clacks-overhead": "GNU Terry Pratchett",
}
[src/main.rs:14] res.text().await = Ok(
    "<!doctype html>\n<html>\n  <head>\n    <title>Too Many Requests</title>\n    <style>\n      body {\n          font: small verdana, arial, helvetica, sans-serif;\n          width: 600px;\n          margin: 0 auto;\n      }\n\n      h1 {\n          height: 40px;\n          background: transparent url(//www.redditstatic.com/reddit.com.header.png) no-repeat scroll top right;\n      }\n\n      textarea.mushroom {\n          display: none\n      }\n    </style>\n  </head>\n  <body>\n    <h1>whoa there, pardner!</h1>\n\n<p>reddit's awesome and all, but you may have a bit of a\nproblem. we've seen far too many requests come from your ip address\nrecently.</p>...

Curl works.

% curl -v https://www.reddit.com -H 'user-agent: some-unique-app.kevincox.ca/1'
*   Trying 151.101.129.140:443...
* Connected to www.reddit.com (151.101.129.140) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*  CAfile: /nix/store/vimyv3i6vgr82hhvrzxbmh3lvmn7hrmj-nss-cacert-3.66/etc/ssl/certs/ca-bundle.crt
*  CApath: none
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN, server accepted to use h2
* Server certificate:
*  subject: C=US; ST=CALIFORNIA; L=SAN FRANCISCO; O=Reddit Inc.; CN=*.reddit.com
*  start date: Oct  5 00:00:00 2021 GMT
*  expire date: Apr  2 23:59:59 2022 GMT
*  subjectAltName: host "www.reddit.com" matched cert's "*.reddit.com"
*  issuer: C=US; O=DigiCert Inc; CN=DigiCert TLS RSA SHA256 2020 CA1
*  SSL certificate verify ok.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x14399a0)
> GET / HTTP/2
> Host: www.reddit.com
> accept: */*
> user-agent: some-unique-app.kevincox.ca/1
> 
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
< HTTP/2 200 
< cache-control: private, s-maxage=0, max-age=0, must-revalidate, no-store
< content-type: text/html; charset=utf-8
< x-frame-options: SAMEORIGIN
< accept-ranges: bytes
< date: Thu, 14 Oct 2021 15:22:41 GMT
< via: 1.1 varnish
< vary: Accept-Encoding, Accept-Encoding
< set-cookie: loid=REDACTED; path=/; expires=Sat, 14 Oct 2023 15:22:41 GMT; domain=.reddit.com; secure; SameSite=None; Secure
< set-cookie: session_tracker=REDACTED; path=/; domain=.reddit.com; secure; SameSite=None; Secure
< set-cookie: token_v2=REDACTED; Path=/; Domain=reddit.com; Expires=Sat, 14 Oct 2023 15:22:41 GMT; HttpOnly; Secure
< set-cookie: csv=1; Max-Age=63072000; Domain=.reddit.com; Path=/; Secure; SameSite=None
< set-cookie: edgebucket=WI5ie9I7xqzWBU3gPo; Domain=reddit.com; Max-Age=63071999; Path=/;  secure
< strict-transport-security: max-age=15552000; includeSubDomains; preload
< server: snooserv
< x-clacks-overhead: GNU Terry Pratchett
< 

    <!DOCTYPE html>
...

I've tried multiple IP addresses and tried to match curl headers. However it seems like Reddit can always tell them apart and denies requests from reqwest.

seanmonstar commented 3 years ago

I do know Reddit does some categorizing based on several details about headers to determine if it's a client to let in, or a bot/script to block.

One difference I see is that curl is using HTTP2. You could either force curl to use HTTP1, or disable reqwest's usage of it's default-tls and instead enable rustls-tls, which has ALPN support and will get it talking h2.

kevincox commented 3 years ago

I tried forcing curl to use 1.1 and 1.0 and it still succeeded. I honestly can't figure out the difference between the requests except for maybe header order? (But I can't see the header order of reqwest.) I'll give rustls-tls a try and see if it helps.

kevincox commented 3 years ago

Strange. Enabling "native-tls-alpn" for HTTP2 didn't work but using rustls worked.

seanmonstar commented 3 years ago

reqwest doesn't know how to use native-tls' ALPN support.

kevincox commented 3 years ago

Well I'm not sure what I am doing wrong but my NGINX logs are saying protocol: HTTP/2.0 with use_native_tls() and features = ["native-tls-alpn", "trust-dns"] 🤷

LGUG2Z commented 3 years ago

I also came across this issue in a service I run. Upgrading from 0.10 to 0.11 and using rustls-tls seemed to have fixed the issue and I'm no longer receiving 429s from Reddit.

reqwest = { version = "0.11", default-features = false, features = ["rustls-tls"] }

This is how I'm installing reqwest in my Cargo.toml- hopefully copying and pasting this might help someone else who is stuck with the same issue.

ducaale commented 3 years ago

Strange. Enabling "native-tls-alpn" for HTTP2 didn't work but using rustls worked.

@kevincox your code works for me if I enable native-tls-alpn:

#[tokio::main]
async fn main() {
    let client = reqwest::Client::builder()
        .user_agent("some-unique-app.kevincox.ca/1")
        .connect_timeout(std::time::Duration::from_secs(60))
        .timeout(std::time::Duration::from_secs(600))
        .build().unwrap();

    let res = client
        .get("https://www.reddit.com")
        .header("accept", "*/*")
        .send().await.unwrap();

    dbg!(res.status());
    dbg!(res.headers());
    dbg!(res.text().await);
}

Cargo.toml

[package]
name = "temp-reddit"
version = "0.1.0"
edition = "2021"

# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html

[dependencies]
reqwest = { version = "0.11.5", features = ["native-tls-alpn"] }
tokio = { version = "1.12.0", features = ["rt", "rt-multi-thread", "macros"] }
$ cargo run
[src/main.rs:17] res.status() = 200
[src/main.rs:18] res.headers() = {
    "cache-control": "private, s-maxage=0, max-age=0, must-revalidate, no-store",
    "content-type": "text/html; charset=utf-8",
    "accept-ranges": "bytes",
    "date": "Sun, 24 Oct 2021 11:21:08 GMT",
    "via": "1.1 varnish",
    "vary": "Accept-Encoding, Accept-Encoding",
    "set-cookie": "loid=0000000000ftjg9803.2.1635074468000.Z0FBQUFBQmhkVUdrblhDODlFcW0zT0FqV2RfVUdhY2ZBb1Y2S0pvNFJFeXRTbDloRFhKcWZoNW1uY3JEc0lUUTVuSVkxU2swZUV1RzVXUjUtSVNuVFhYeDBWUllacGVTMUJYME82RzJJcnRfR25lbEx5NHZIZ0JheFJFZzB1d3d2M0tSUGZTTU5ZY0w; path=/; expires=Tue, 24 Oct 2023 11:21:08 GMT; domain=.reddit.com; secure; SameSite=None; Secure",
    "set-cookie": "session_tracker=opnjeprrqggamfbfam.0.1635074468413.Z0FBQUFBQmhkVUdraVdZSGZNRl9tTlRzTFFGWDNpLWMzbHlYQTk5NlVfeUUxdFF4VWlfN2l0eC0tNUVqT0RJcEMyZWxzZ3lCel9IUHdPR1pZM3NjQXpSZVVIcEg0NnREb2lyYjJ2ODlmbUxUakFXa3ZnZGJfRUdraldjUFk2eS1xNXMxTFFHdnlKODk; path=/; domain=.reddit.com; secure; SameSite=None; Secure",
    "set-cookie": "token_v2=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJleHAiOjE2MzUwNzc5NDgsInN1YiI6Ii16RHU5UUloY2Y3RVRRMUM3TmlQT0QwWGUwWXM4VXciLCJsb2dnZWRJbiI6ZmFsc2UsInNjb3BlcyI6WyIqIiwiZW1haWwiLCJwaWkiXX0.mEdtl8jJnJCeP-w9-SQNXwUCfrognWC3MfNDQ-Qp2bI; Path=/; Domain=reddit.com; Expires=Tue, 24 Oct 2023 11:21:08 GMT; HttpOnly; Secure",
    "set-cookie": "csv=1; Max-Age=63072000; Domain=.reddit.com; Path=/; Secure; SameSite=None",
    "set-cookie": "edgebucket=tw6VtRFWwR6TEcx7oW; Domain=reddit.com; Max-Age=63071999; Path=/;  secure",
    "strict-transport-security": "max-age=31536000; includeSubdomains",
    "x-content-type-options": "nosniff",
    "x-frame-options": "SAMEORIGIN",
    "x-xss-protection": "1; mode=block",
    "server": "snooserv",
    "x-clacks-overhead": "GNU Terry Pratchett",
}
[src/main.rs:19] res.text().await = Ok(
    "\n    <!DOCTYPE html>\n    <html lang=\"en-US\">\n      <head>\n        <script>\n    var __SUPPORTS_TIMING_API = typeof performance === 'object' && !!performance.mark && !! performance.measure && !!performance.getEntriesByType;\n    function __perfMark(name) { __SUPPORTS_TIMING_API && performance.mark(name); };\n    var __firstPostLoaded = false;\n    function __markFirstPostVisible() {\n

// further output omitted
kevincox commented 3 years ago

Hmm, I thought that wasn't working before but it is indeed working for me now too. This also makes the requests use HTTP/2 based on logs from my nginx.