skrapeit / skrape.it

A Kotlin-based testing/scraping/parsing library providing the ability to analyze and extract data from HTML (server & client-side rendered). It places particular emphasis on ease of use and a high level of readability by providing an intuitive DSL. It aims to be a testing lib, but can also be used to scrape websites in a convenient fashion.
https://docs.skrape.it
MIT License
789 stars 57 forks source link

[QUESTION] Are there cases where BrowserFetcher does not fully support CSR? #224

Open pistolcaffe opened 1 year ago

pistolcaffe commented 1 year ago

describe what you want to archive I am going to create a user guide page for my app and I need to crawl that page in my app. (I need to crawl certain urls in the app as well as notion pages) https://fundevstudio.notion.site/524eafbfa8f2414898d6d8d79f222c05?pvs=4

However, even if i use the initial BrowserFetcher,cannot get the title of the loaded page.

Please let me know if there is any additional way I can do it.

Code Sample

fun main(args: Array<String>) {
    skrape(BrowserFetcher) {
        request {
            url = "https://fundevstudio.notion.site/524eafbfa8f2414898d6d8d79f222c05?pvs=4"
        }

        response {
            htmlDocument {
                println("title: $titleText")
            }
        }
    }
}

[expect] title: 인사이트 플로우 가이드 [but] title: Notion – The all-in-one workspace for your notes, tasks, wikis, and databases.

If it is not possible, waitUntill property value similar to playwright, puppeteer: load, networkidle, documentLoaded Please consider providing options.

pistolcaffe commented 1 year ago

When using htmlUnit directly, I found the following exception. net.sourceforge.htmlunit.corejs.javascript.EvaluatorException: identifier is a reserved word: class (https://fundevstudio.notion.site/8402-8521e6e24e557272e4c0.js#1)

Since htmlUnit is using an outdated Rhino, I think we may need to consider porting it to a V8 engine or something.

Of course, it's only speculation that the exception caused by the engine is the direct cause. If there is any additional information, I will write a comment.