skrapeit / skrape.it

A Kotlin-based testing/scraping/parsing library providing the ability to analyze and extract data from HTML (server & client-side rendered). It places particular emphasis on ease of use and a high level of readability by providing an intuitive DSL. It aims to be a testing lib, but can also be used to scrape websites in a convenient fashion.
https://docs.skrape.it
MIT License
813 stars 59 forks source link

Multiplatform #209

Open McDjuady opened 1 year ago

McDjuady commented 1 year ago

I've been working over the last few months to create a multiplatform version for skrape.it It's somewhat in line with #196 but I'm a bit further along in some areas, which is why i wanted to get this out. So far I've converted the buildscripts to multiplatform and implemented some modules in JS. This is still pretty much WIP and I intend to keep working on it. I'll update the pull request as I get further along and improve the code

What's done so far:

What still needs to be done:

Other notable changes:

christian-draeger commented 1 year ago

Wow this is great news 🤩 Really looking forward to this.

Let me know if there are any questions that can maybe help to understand internals of the Lib if needed.

I personally have no experience so far with creating multiplattform Libs in kotlin. In my daily life I am very much into using kotlin on the jvm/server-side. But I think skrape{it} would be very benificial in the kotlin js world 🙂

I am really glad if skrape{it} could become a multiplatform Lib 🥳

McDjuady commented 1 year ago

My main question right now would be if multiple fetchers are really necessary. To keep things simple I'd like to use ktor as a client for any fetch operations. However this implies, that fetching can only be done via a suspend function on non JVM platforms, since JVM is the only target to support runBlocking. This would remove HttpFetcher since it would become redundant. JsExecution is a feature of http-parser, which falls back onto BrowserFetcher on JVM, in JS it's rendered directly using JSDOM. It would make sense to move the functionality for JVM to http-praser as well, which would deprecate BrowserFetcher, leaving only AsyncFetcher, which could then be migrated into Base Fetcher to leave only one fetcher implementation. My Plan would be to create on AsyncFetcher supporting jsExecution and with JVM specific extension functions providing blocking calls if needed.

I geuss the question is if that is okay with you and if that's a direction you feel comfortable with?

christian-draeger commented 1 year ago

Hey, sorry for late response.

My main question right now would be if multiple fetchers are really necessary.

the different fetchers (beside BrowserFetcher which has a little more special purpose) only exists because of convenience for the users to either run blocking or non-blocking.

To keep things simple I'd like to use ktor as a client for any fetch operations. However this implies, that fetching can only be done via a suspend function on non JVM platforms, since JVM is the only target to support runBlocking. This would remove HttpFetcher since it would become redundant.

if this could be simplified and as you mentioned "fetching can only be done via a suspend function on non JVM platforms" anyway i would apprechiate simplification (reducing it to 1) of the fetchers a lot as long as we provide an intuitive way to still make blocking fetches on the JVM possible. either by at least documenting that wrapping calls in a runBlocking scope is necessary or preferred provide something like a skrapeBlocking method to the DSL.

just a quick idea somewhat inspired by what kohttp dsl is doing here, but i am open for other ideas as well: Synchronous calls: https://kohttp.gitbook.io/docs/core/synchronous-calls/get Asynchronous calls: https://kohttp.gitbook.io/docs/core/asynchronous-calls/async-get

To keep things simple I'd like to use ktor as a client for any fetch operations

i am totally fine to switch to ktor client completely as a build-in/default as long if we still provide a way for users to implement custom fetchers to keep the lib flexible and not completely dependent on ktor. i am not deep into ktor client, if it is possible to implement custom ktor clients anyway that would be good enough.

JsExecution is a feature of http-parser, which falls back onto BrowserFetcher on JVM, in JS it's rendered directly using JSDOM. It would make sense to move the functionality for JVM to http-praser as well, which would deprecate BrowserFetcher, leaving only AsyncFetcher, which could then be migrated into Base Fetcher to leave only one fetcher implementation.

That sounds resonable to me and would again simplify the code base a lot =) i like!

Last but not least the resaon why Browserfetcher is an own fetcher but also part of the html-parser

Browserfetcher uses html-unit to act as a JVM based headless browser in order to be able to render pages that include JS correctly. it makes the actual request using html-unit as well as rendering the response in this embedded headless browser. That in turn means that if you would fetch a page that includes conditional rendering on some really browser specific things as request headers like authorization, set-cookie or user-agent etc just works as you would expect using a "real" browser. BrowserFetcher.render() that is currently used in the html-parser on the other hand just renders a html string without request information available. calling WebClient.withOptions(request: Request) extension function on the com.gargoylesoftware.htmlunit.WebClient inside of BrowserFetchers .render() function could probably fix align the behaviour of BrowserFetcher.render and BrowserFetcher.renderparse. But since skrape{it} supports parsing html from string you don't have a request when just parsing a html string. If that can be reworked somehow to maybe add the request options to html-parsers jsExecution mode if available that should be fine. on he other hand i have to admit that it is only there for a super specific use-case that is maybe not worth to maintain. from that point of view i would be fine with a simplistic solution that drops BrowserFetcher and let html-parser render html with or without js execution :) If we could still support adding request data to that rendering this would be nice but not a must. Is the described point JVM related only or also a deal with JSDOM?

I geuss the question is if that is okay with you and if that's a direction you feel comfortable with?

overall absolutely =)

McDjuady commented 1 year ago

Thanks for clearing all that up! I think most of what you are describing should be possible using just the Ktor client. The hardest part will probably be to get both Ktor and HtmlUnit to use the same client for requesting any additional resources that might come up when running JS on a website (I'll probably write some wrapper classes so HtmlUnit uses the ktor client). As for JSDOM it's pretty similar. With a few wrappers it'll probably be possible to have a single client to make all the requests.

It might even be possible to replace Skrape.it's own Request class with the Ktor Request builder, since they both serve the same purpose.

Could you maybe provide some tests in which the BrowserFetcher specific features are used so I can make sure it still fulfills those requirements?

codecov[bot] commented 1 year ago

Codecov Report

Merging #209 (53ccb08) into master (956e21d) will decrease coverage by 83.18%. The diff coverage is 0.00%.

:exclamation: Current head 53ccb08 differs from pull request most recent head 90ae18d. Consider uploading reports for the commit 90ae18d to get more accurate results

@@            Coverage Diff             @@
##           master    #209       +/-   ##
==========================================
- Coverage   83.18%   0.00%   -83.17%     
==========================================
  Files          39      40        +1     
  Lines        1141    1173       +32     
  Branches      173     145       -28     
==========================================
- Hits          949       0      -949     
- Misses        131    1173     +1042     
+ Partials       61       0       -61     
Impacted Files Coverage Δ
...commonMain/kotlin/it/skrape/matchers/Assertions.kt 0.00% <ø> (ø)
...in/it/skrape/matchers/ContentTypeHeaderMatchers.kt 0.00% <0.00%> (ø)
...c/commonMain/kotlin/it/skrape/matchers/Matchers.kt 0.00% <ø> (ø)
...onMain/kotlin/it/skrape/matchers/StatusMatchers.kt 0.00% <0.00%> (ø)
...c/jvmMain/kotlin/it/skrape/fetcher/AsyncFetcher.kt 0.00% <0.00%> (ø)
...src/jvmMain/kotlin/it/skrape/fetcher/extensions.kt 0.00% <0.00%> (ø)
...monMain/kotlin/it/skrape/fetcher/Authentication.kt 0.00% <0.00%> (ø)
.../src/commonMain/kotlin/it/skrape/fetcher/Cookie.kt 0.00% <0.00%> (ø)
...src/commonMain/kotlin/it/skrape/fetcher/Request.kt 0.00% <0.00%> (ø)
.../src/commonMain/kotlin/it/skrape/fetcher/Result.kt 0.00% <ø> (ø)
... and 66 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

christian-draeger commented 1 year ago

Hey, Again, sorry for the late reply. We are currently working towards a deadline at my job and this means that I unfortunately don't have as much time for open source as I would like to have 🥲

But anyway I took a week off in mid November and will try to provide some tests with the browserfetcher specific features :)

With a few wrappers it'll probably be possible to have a single client to make all the requests.

That sounds absolutely fantastic 🙂

It might even be possible to replace Skrape.it's own Request class with the Ktor Request builder, since they both serve the same purpose.

Tbh I like the look and feel of skrape{it}s request dsl more than ktors. But we could map skrape{it} request dsl to ktor request object internally

McDjuady commented 1 year ago

Hi!

So I've been working along on the multiplatform/JS implementation of skrape.it. It's fully functional on JS now and I'd say I'm quite close to completion. The major missing feature right now is Proxy support on JS.

Because of some of the (breaking) changes I've made this would probably be better suited for a major realease. Here is a (non inclusive) list of what I've changed since the last time:

Breaking Changes

Notable Changes

Known issues and limitations

TODOs

If you have the time I'd be very happy if you could look the changes over and give me some feedback if you are happy with the direction this is going. Also there seems to be an issue with the skrape.it documentation site, where it's impossible to fetch it from JS. Cloudflare just returns 500 with an error page. You can try this yourself with postman or simply try to fetch the page in node. Maybe you could look into that since there might be some information in the server logs as to why it's happening.

Cheers Max

christian-draeger commented 1 year ago

hey @McDjuady this is such awesome news. a really really big thank you for all the your effort so far. you rock!

first of all don't worry because of breaking changes. we will release a new major version :) if smooth migration is possible thats good, if not also ok as long as we write good docs.

i will have a look at the code at latest until monday. job is much more chilled again currently^^

PS: skrape.it docs are hosted via gitbook.com, can not see or do much over there since its SaaS. side-note: i thought about having a better documentation site. i played around with kotlin-playground) in the past. the idea was to have kind of an interactive docu like the official kotlin docs where you can try out code snippets directly. i wanted to work on that again soon, therefore i wouldn't like to invest to much time in the old docs setup / want only add documentation there.

christian-draeger commented 1 year ago

again sorry for late response. i just reviewed what you did so far. looks good to me 👍

russellbanks commented 1 year ago

Any update on this?

christian-draeger commented 1 year ago

Hey @McDjuady, I know it's been a while. But still I want to kindly ask if you would be open to finish the PR? 🙂