A Kotlin-based testing/scraping/parsing library providing the ability to analyze and extract data from HTML (server & client-side rendered). It places particular emphasis on ease of use and a high level of readability by providing an intuitive DSL. It aims to be a testing lib, but can also be used to scrape websites in a convenient fashion.
This is a proposal for an open API that allows routing any HTTP traffic through your own custom client implementation.
The idea is that any external client has to fulfill the (remodeled) HttpFetcher interface. In particular, this interface is now typed based on the request class as such:
interface Fetcher<T> {
fun fetch(request: T): Result
val requestBuilder: T
}
The idea is that with so many different paradigms of how to build requests, skrape{it} would quickly navigate itself into a corner when trying to "unify" them all under one common "request" interface. Instead, we make a virtue out of diversity and let the user decide.
The typical skrape DSL will change as follows:
data class MyOwnFancyRequest(var amazingUrl: String, var astonishingMethod: HttpWowMethod, var whoNeedsProxiesAnyways: DumbProxy? = null)
class MyCoolCustomFetcherAdapter(val foo: FooConfig, val bar: BarSettings): Fetcher<MyOwnFancyRequest> // implementation here...
val fetcher: Fetcher<MyOwnFancyRequest> = MyCoolCustomFetcherAdapter(foo, bar)
val scraped = skrape(fetcher) {
request {
// you're operating in the scope of "MyOwnFancyRequest" here
amazingUrl = "omg://wow.lol.xd"
}
extract {
// same as before
}
}
Essentially, this reduces the logic down to "dear fetcher, please give me a default request (val requestBuilder) that I may or may not modify in the DSL" and "dear fetcher, take this (optionally modified) request and execute it".
The existing BrowserFetcher and HttpFetcher implementations have been adapted to the suggested structure.
BREAKING CHANGES
mode obviously doesn't exist anymore.
All request configuration calls have to be wrapped in a request {} block (otherwise it would be impossible to infer the request type of the fetcher at top level in skrape {}
Every skrape {} call has to be passed its own client every time.
ToDo
Try to optimise the existing implementations based on this new infrastructure. Maybe distinguish between client-global configurations (sslRelaxed, proxy) and per-request information (url, http verb, etc.). This way, we don't have to spin up a new client per each request.
Create separate Maven artifacts for the custom implementations
Update readme (uaaah....)
More tests??
Coroutines? fun fetch(request: T) could be suspend, but then we would force the user to execute the top-level skrape {} DSL from within a coroutine context.
This is a proposal for an open API that allows routing any HTTP traffic through your own custom client implementation.
The idea is that any external client has to fulfill the (remodeled)
HttpFetcher
interface. In particular, this interface is now typed based on the request class as such:The idea is that with so many different paradigms of how to build requests,
skrape{it}
would quickly navigate itself into a corner when trying to "unify" them all under one common "request" interface. Instead, we make a virtue out of diversity and let the user decide.The typical
skrape
DSL will change as follows:Essentially, this reduces the logic down to "dear fetcher, please give me a default request (
val requestBuilder
) that I may or may not modify in the DSL" and "dear fetcher, take this (optionally modified) request and execute it".The existing
BrowserFetcher
andHttpFetcher
implementations have been adapted to the suggested structure.BREAKING CHANGES
mode
obviously doesn't exist anymore.request {}
block (otherwise it would be impossible to infer the request type of the fetcher at top level inskrape {}
Every
skrape {}
call has to be passed its own client every time.ToDo
fun fetch(request: T)
could besuspend
, but then we would force the user to execute the top-levelskrape {}
DSL from within a coroutine context.Any thoughts appreciated! :D