Open kalanyuz opened 1 year ago
This is such a dick move from Etherscan, but it doesn't really matter as Etherscan scraping has been disabled for a while with the current Etherface deployment. That being said, I'd be happy to accept a PR for this issue if you're interested in working on this.
Do you have any recommendations on where to begin @volsa ? On the top of my head this situation could be handled with Selenium. Not sure if there's a workaround for rust.
Yeah, Selenium was the first solution that popped into my mind. The other was embedding Python code using PyO3 to use cloudscraper because no such Rust libraries exist, but I'm not sure if the library is even working atm. Long-term, Selenium is probably the better solution though.
I did some quick research to see if this can be accomplished in Rust using ChromeDriver, and it kind of works. Key findings were:
--disable-blink-features
and --disable-blink-features=AutomationControlled
must be set; haven't tested if either one alone is sufficient but should be?Calling the following code using the fantoccini library should then bypass the CF protection.
use fantoccini::{ClientBuilder, Locator};
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let mut caps = serde_json::map::Map::new();
caps.insert(
"goog:chromeOptions".to_string(),
serde_json::json!({
"args": [
// "--headless=new",
"--disable-blink-features",
"--disable-blink-features=AutomationControlled",
]}
),
);
let client = ClientBuilder::native().capabilities(caps).connect("http://localhost:4444").await?;
client.goto("https://etherscan.io/contractsVerified").await?;
let res = client.wait().for_element(Locator::Css("#content > section.container-xxl.pt-5.pb-12")).await?;
let html = res.html(true).await.unwrap();
println!("{html}");
Ok(())
}
Ideally this can be merged with https://github.com/volsa/etherface/blob/master/etherface-lib/src/api/etherscan.rs
Recently they have deployed Cloudflare script that returns 403 if you are accessing the website from scripts.