skrapeit / skrape.it

A Kotlin-based testing/scraping/parsing library providing the ability to analyze and extract data from HTML (server & client-side rendered). It places particular emphasis on ease of use and a high level of readability by providing an intuitive DSL. It aims to be a testing lib, but can also be used to scrape websites in a convenient fashion.
https://docs.skrape.it
MIT License
789 stars 57 forks source link

Add a fetcher that uses a real Chrome browser to download the html #237

Open johanoskarsson opened 4 months ago

johanoskarsson commented 4 months ago

Adds a new Fetcher that uses a real Chrome browser to fetch the html. This solved a problem where I was unable to fetch a page that was partially generated by javascript using any of the existing fetchers. (I assume the page required a modern real browser for some reason I did not investigate further).

This change uses the cdt-java-client library found here to launch and communicate with a Chrome browser: https://github.com/kklisura/chrome-devtools-java-client However due to a breaking change in Chrome that has not been fixed in this library I am using a fork with that one patch applied: io.fluidsonic.mirror:cdt-java-client:4.0.0-fluidsonic-1. Hopefully the change gets merged back into the main library.

WIP warning: I figured I would publish this PR in its current state in case it helps anyone else. It does however not fullfil all the expectations of a fetcher. It does not return the correct http status etc, just the body. There is a Network class that can probably be used to extract those.

codecov[bot] commented 4 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 89.56%. Comparing base (382f21b) to head (475065d).

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #237 +/- ## ======================================= Coverage 89.56% 89.56% ======================================= Files 38 38 Lines 986 986 Branches 69 69 ======================================= Hits 883 883 Misses 81 81 Partials 22 22 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.