skrapeit / skrape.it

A Kotlin-based testing/scraping/parsing library providing the ability to analyze and extract data from HTML (server & client-side rendered). It places particular emphasis on ease of use and a high level of readability by providing an intuitive DSL. It aims to be a testing lib, but can also be used to scrape websites in a convenient fashion.
https://docs.skrape.it
MIT License
815 stars 61 forks source link

Add support for integration of custom clients #96

Closed gregorbg closed 4 years ago

gregorbg commented 4 years ago

This is a proposal for an open API that allows routing any HTTP traffic through your own custom client implementation.

The idea is that any external client has to fulfill the (remodeled) HttpFetcher interface. In particular, this interface is now typed based on the request class as such:

interface Fetcher<T> {
    fun fetch(request: T): Result
    val requestBuilder: T
}

The idea is that with so many different paradigms of how to build requests, skrape{it} would quickly navigate itself into a corner when trying to "unify" them all under one common "request" interface. Instead, we make a virtue out of diversity and let the user decide.

The typical skrape DSL will change as follows:

data class MyOwnFancyRequest(var amazingUrl: String, var astonishingMethod: HttpWowMethod, var whoNeedsProxiesAnyways: DumbProxy? = null)

class MyCoolCustomFetcherAdapter(val foo: FooConfig, val bar: BarSettings): Fetcher<MyOwnFancyRequest> // implementation here...

val fetcher: Fetcher<MyOwnFancyRequest> = MyCoolCustomFetcherAdapter(foo, bar)

val scraped = skrape(fetcher) {
    request {
       // you're operating in the scope of "MyOwnFancyRequest" here
       amazingUrl = "omg://wow.lol.xd"
    }

    extract {
        // same as before
    }
}

Essentially, this reduces the logic down to "dear fetcher, please give me a default request (val requestBuilder) that I may or may not modify in the DSL" and "dear fetcher, take this (optionally modified) request and execute it".

The existing BrowserFetcher and HttpFetcher implementations have been adapted to the suggested structure.

BREAKING CHANGES

Any thoughts appreciated! :D