volsa / etherface

Ethereum Signature Database
https://www.etherface.io
GNU General Public License v3.0
174 stars 22 forks source link

Bypassing scrape protection on etherscan #20

Open kalanyuz opened 1 year ago

kalanyuz commented 1 year ago

Recently they have deployed Cloudflare script that returns 403 if you are accessing the website from scripts.

volsa commented 1 year ago

This is such a dick move from Etherscan, but it doesn't really matter as Etherscan scraping has been disabled for a while with the current Etherface deployment. That being said, I'd be happy to accept a PR for this issue if you're interested in working on this.

kalanyuz commented 1 year ago

Do you have any recommendations on where to begin @volsa ? On the top of my head this situation could be handled with Selenium. Not sure if there's a workaround for rust.

volsa commented 1 year ago

Yeah, Selenium was the first solution that popped into my mind. The other was embedding Python code using PyO3 to use cloudscraper because no such Rust libraries exist, but I'm not sure if the library is even working atm. Long-term, Selenium is probably the better solution though.

volsa commented 1 year ago

I did some quick research to see if this can be accomplished in Rust using ChromeDriver, and it kind of works. Key findings were:

  1. The ChromeDriver has to be patched before it can be used because CloudFlare otherwise blocks the request. To do that download https://chromedriver.storage.googleapis.com/index.html?path=112.0.5615.49/ then apply the following https://github.com/ultrafunkamsterdam/undetected-chromedriver/blob/bf7dcf8b5713020de7454844fb80036b8c456503/undetected_chromedriver/patcher.py#L217-L239
  2. Flags --disable-blink-features and --disable-blink-features=AutomationControlled must be set; haven't tested if either one alone is sufficient but should be?
  3. (MacOS ARM only) Patching the ARM ChromeDriver will result in panics, thus the x86_64 version is needed using Rosetta

Calling the following code using the fantoccini library should then bypass the CF protection.

use fantoccini::{ClientBuilder, Locator};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let mut caps = serde_json::map::Map::new();
    caps.insert(
        "goog:chromeOptions".to_string(),
        serde_json::json!({
                "args": [
                    // "--headless=new",
                    "--disable-blink-features",
                    "--disable-blink-features=AutomationControlled",
            ]}
        ),
    );

    let client = ClientBuilder::native().capabilities(caps).connect("http://localhost:4444").await?;

    client.goto("https://etherscan.io/contractsVerified").await?;
    let res = client.wait().for_element(Locator::Css("#content > section.container-xxl.pt-5.pb-12")).await?;

    let html = res.html(true).await.unwrap();
    println!("{html}");

    Ok(())
}

https://user-images.githubusercontent.com/29666622/233849890-57bd2463-0079-46d9-b945-c4101e346ca2.mov

Ideally this can be merged with https://github.com/volsa/etherface/blob/master/etherface-lib/src/api/etherscan.rs