Closed chachako closed 3 years ago
hey it is simply because you are building a request object but then you are not call ing the correct function to run the scraper :) tyou can try something like this:
fun main() {
val responseBody = skrape(HttpFetcher) {
request {
url = "https://mvnrepository.com/artifact/kotlin"
}
extract { // <--- if you call extract or expect the request object will be taken and the scrape function of the passed Scraper will be executed. furthermore the result object will be available inside of the lambda function
responseBody
}
}
println(responseBody)
}
simular to what is documented in the README.md --> https://github.com/skrapeit/skrape.it#testing-html-responses
you can use a custom implementation like in your given example if you need special stuff or just use an already implemented one like the HttpFetcher
by just passing it :)
just let me know if it helped or if you have further questions :)
@christian-draeger No, this is the same error in response, you can compare the HTML of the web page by responding, the content is wrong.
Error response:
Correct response:
Ok now I get it :) Looks like mvnrepository is trying to block crawlers by adding a captcha to their site.
My first guess would be that you could try to add a user-agent header to the request to "hide" that the request hasn't been made by a real browser.
To do so you can add user Agent parameter to the request.
const val CHROME_UA = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36"
skrape(HttpFetcher) {
request {
url = "your.url"
userAgent = CHROME_UA
}
extract {...}
}
Sorry for bad code example formatting. I'm currently answering from my phone 😅
@christian-draeger Yes, I tried this, but I regret that it doesn't work. I suspect this is the problem of the apache engine, because when I use ktor
to visit it, it reports 403, but if I use okhttp engine, the response content is correct.
Maybe you can help me figure out this problem? I am grateful 🙂 . Otherwise, skrape.it
may should add an okhttp engine extension
Intressting would be how the requests made with apache-http (HttpFetcher) differ from the ones made with okhttp. I mean in the end it's all just a standardized http request but it seams that okhttp uses some defaults that will lead to the behavior that mvnrepository.org's server will not detect this calls as a "bot".
We could try to send a request with both HttpFetcher and your fetcher implementation with okhttp to https://httpbin.org/anything
Httpbin will just return the complete http request. therewith we could compare what is different between the fetchers.
if i do
skrape(HttpFetcher) {
request {
url = "http://httpbin.org/anything"
}
extract {
println(responseBody)
}
}
i am getting:
{
"args": {},
"data": "",
"files": {},
"form": {},
"headers": {
"Accept": "*/*",
"Accept-Charset": "UTF-8",
"Host": "httpbin.org",
"User-Agent": "Mozilla/5.0 skrape.it",
"X-Amzn-Trace-Id": "Root=1-60dadfc5-276952652aa8228c187e5304"
},
"json": null,
"method": "GET",
"origin": "87.174.82.109",
"url": "http://httpbin.org/anything"
}
can you try with your okhttp version and post the response?
Further interesting fact is that according to their robots.txt mvnrepository.org wants to be scraped by Google only :D https://mvnrepository.com/robots.txt
I mean technically it will not hinder someone to scrape them but the response looks like they have a cloudflare firewall in between that maybe enforces this settings. The more intressting what okhttp is sending as request to not be detected as a bot.
Here I found a really good article on not detection while scraping https://www.scrapehero.com/how-to-prevent-getting-blacklisted-while-scraping/
@christian-draeger for okhttp:
{
"args": {},
"data": "",
"files": {},
"form": {},
"headers": {
"Accept": "*/*",
"Accept-Charset": "UTF-8",
"Host": "httpbin.org",
"User-Agent": "Mozilla/5.0 skrape.it",
"X-Amzn-Trace-Id": "Root=1-60db15d7-4389581030369c1b24e77689"
},
"json": null,
"method": "GET",
"origin": "120.84.9.222",
"url": "http://httpbin.org/anything"
}
Looks to be the exact same as the apache-http call to me 🤔 Technically it makes absolutely no sense that this settings work and the other one not 😅
since this is really specific to the bot detection of the target website you want to scrape i think i will close this issue. from what i could investigate it is not a bug in the library and you have a workaround.
Crawling this website in skrape.it will get the wrong HTML, directly in
jsoup
will get 403, but if viaokhttp
, everything is normal, can this be solved?My current solution is: