skrapeit / skrape.it

A Kotlin-based testing/scraping/parsing library providing the ability to analyze and extract data from HTML (server & client-side rendered). It places particular emphasis on ease of use and a high level of readability by providing an intuitive DSL. It aims to be a testing lib, but can also be used to scrape websites in a convenient fashion.
https://docs.skrape.it
MIT License
815 stars 59 forks source link

[BUG] Unable to crawling the mvnrepository site #151

Closed chachako closed 3 years ago

chachako commented 3 years ago

Crawling this website in skrape.it will get the wrong HTML, directly in jsoup will get 403, but if via okhttp, everything is normal, can this be solved?

My current solution is:

skrape(OkHttpFetcher) {
  request { url = "https://mvnrepository.com/artifact/kotlin" }
  println(scrape().responseBody)
}

object OkHttpFetcher : NonBlockingFetcher<Request> {
  override val requestBuilder: Request get() = Request()

  @Suppress("BlockingMethodInNonBlockingContext")
  override suspend fun fetch(request: Request): Result = OkHttpClient().newCall(
    okhttp3.Request.Builder()
      .url(request.url)
      .build()
  ).execute().let {
    val body = it.body!!
    Result(
      responseBody = body.string(),
      responseStatus = Result.Status(it.code, it.message),
      contentType = body.contentType()?.toString()?.replace(" ", ""),
      headers = it.headers.toMap(),
      cookies = emptyList(),
      baseUri = it.request.url.toString()
    )
  }
}
christian-draeger commented 3 years ago

hey it is simply because you are building a request object but then you are not call ing the correct function to run the scraper :) tyou can try something like this:

fun main() {
    val responseBody = skrape(HttpFetcher) {
        request {
            url = "https://mvnrepository.com/artifact/kotlin"
        }
        extract { // <--- if you call extract or expect the request object will be taken and the scrape function of the passed Scraper will be executed. furthermore the result object will be available inside of the lambda function
            responseBody
        }
    }

    println(responseBody)
}

simular to what is documented in the README.md --> https://github.com/skrapeit/skrape.it#testing-html-responses

you can use a custom implementation like in your given example if you need special stuff or just use an already implemented one like the HttpFetcher by just passing it :)

just let me know if it helped or if you have further questions :)

chachako commented 3 years ago

@christian-draeger No, this is the same error in response, you can compare the HTML of the web page by responding, the content is wrong.

chachako commented 3 years ago

Error response: image

Correct response: image

christian-draeger commented 3 years ago

Ok now I get it :) Looks like mvnrepository is trying to block crawlers by adding a captcha to their site.

My first guess would be that you could try to add a user-agent header to the request to "hide" that the request hasn't been made by a real browser.

To do so you can add user Agent parameter to the request.

const val CHROME_UA = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36" 

skrape(HttpFetcher) {
request {
url = "your.url"
userAgent = CHROME_UA
} 
extract {...} 
}

Sorry for bad code example formatting. I'm currently answering from my phone 😅

chachako commented 3 years ago

@christian-draeger Yes, I tried this, but I regret that it doesn't work. I suspect this is the problem of the apache engine, because when I use ktor to visit it, it reports 403, but if I use okhttp engine, the response content is correct.

Maybe you can help me figure out this problem? I am grateful 🙂 . Otherwise, skrape.it may should add an okhttp engine extension

christian-draeger commented 3 years ago

Intressting would be how the requests made with apache-http (HttpFetcher) differ from the ones made with okhttp. I mean in the end it's all just a standardized http request but it seams that okhttp uses some defaults that will lead to the behavior that mvnrepository.org's server will not detect this calls as a "bot".

We could try to send a request with both HttpFetcher and your fetcher implementation with okhttp to https://httpbin.org/anything

Httpbin will just return the complete http request. therewith we could compare what is different between the fetchers.

if i do

skrape(HttpFetcher) {
            request {
                url = "http://httpbin.org/anything"
            }
            extract {
                println(responseBody)
            }
        }

i am getting:

{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Charset": "UTF-8", 
    "Host": "httpbin.org", 
    "User-Agent": "Mozilla/5.0 skrape.it", 
    "X-Amzn-Trace-Id": "Root=1-60dadfc5-276952652aa8228c187e5304"
  }, 
  "json": null, 
  "method": "GET", 
  "origin": "87.174.82.109", 
  "url": "http://httpbin.org/anything"
}

can you try with your okhttp version and post the response?

Further interesting fact is that according to their robots.txt mvnrepository.org wants to be scraped by Google only :D https://mvnrepository.com/robots.txt

I mean technically it will not hinder someone to scrape them but the response looks like they have a cloudflare firewall in between that maybe enforces this settings. The more intressting what okhttp is sending as request to not be detected as a bot.

Here I found a really good article on not detection while scraping https://www.scrapehero.com/how-to-prevent-getting-blacklisted-while-scraping/

chachako commented 3 years ago

@christian-draeger for okhttp:

{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Charset": "UTF-8", 
    "Host": "httpbin.org", 
    "User-Agent": "Mozilla/5.0 skrape.it", 
    "X-Amzn-Trace-Id": "Root=1-60db15d7-4389581030369c1b24e77689"
  }, 
  "json": null, 
  "method": "GET", 
  "origin": "120.84.9.222", 
  "url": "http://httpbin.org/anything"
}
christian-draeger commented 3 years ago

Looks to be the exact same as the apache-http call to me 🤔 Technically it makes absolutely no sense that this settings work and the other one not 😅

christian-draeger commented 3 years ago

since this is really specific to the bot detection of the target website you want to scrape i think i will close this issue. from what i could investigate it is not a bug in the library and you have a workaround.