skrapeit / skrape.it

A Kotlin-based testing/scraping/parsing library providing the ability to analyze and extract data from HTML (server & client-side rendered). It places particular emphasis on ease of use and a high level of readability by providing an intuitive DSL. It aims to be a testing lib, but can also be used to scrape websites in a convenient fashion.
https://docs.skrape.it
MIT License
805 stars 59 forks source link

[BUG] parsing long htmls tree with js execution fails #134

Closed christian-draeger closed 3 years ago

christian-draeger commented 3 years ago

Describe the bug copied over from skrape{it} slack channel Hello! For the most part I’m loving doing idiomatic scraping with Skrape{it}! However, I’m having trouble getting the js-rendered functionality working. The example at https://docs.skrape.it/docs/dsl/extract-client-side-rendered-data looks like it’s for a previous version, since mode = Mode.DOM is no longer available. According to the docs on Github, it looks like all I should need to do is pass BrowserFetcher to the skrape function as an argument, but that doesn’t seem to do the trick. I tried setting jsExecution to true, e.g. something like this:

Code Sample

skrape(BrowserFetcher) {
    request { url = urlToScrape }
}
extract {
    htmlDocument(html = responseBody, baseUri = baseUri, jsExecution = true) {
        val t = title { findFirst { text } }
        i { "Got title:$t" }
    }
}

Expected behavior It should be possible to render big html trees

Additional context FWIW, after some investigation, the issue seems to lie in the Parser object. Specifically - the toUriScheme() method was causing a massive URL, that later when added as a Referer header had around 44kb of content - which was too large for the server to accept, hence the 400. Truncating it to 200 bytes meant there were no more errors, but unfortunately the generated dom content was incomplete, so at this point I don’t see a way of using Skrape.it for JS-based sites such as this one. If any fixes are made to make parsing this page viable with Skrape.it I will definitely try again! (bearbeitet)