skrapeit / skrape.it

A Kotlin-based testing/scraping/parsing library providing the ability to analyze and extract data from HTML (server & client-side rendered). It places particular emphasis on ease of use and a high level of readability by providing an intuitive DSL. It aims to be a testing lib, but can also be used to scrape websites in a convenient fashion.
https://docs.skrape.it
MIT License
815 stars 61 forks source link

Unify `selects` package document hierarchy to allow for nested queries #88

Closed gregorbg closed 4 years ago

gregorbg commented 4 years ago

[EDIT] Removed the DocElement hack, still need to debate the underlying question though.

Fixes #86, #79

This PR aims to introduce a compact class hierarchy allowing for nested/recursive CSS selections. The outwards API should not change and all the currently existing tests pass without modification.

General note: The names of the newly introduced classes are totally open for debate. Especially for DomTreeElement I'm struggling to find anything better.

Class design

JSoup / DOM

The former Doc and DocElement classes are now both inheriting from a new DomTreeElement class based on the fact that both classes expose JSoup Elements.

The common DomTreeElement parent class exposes the relevant fraction of these JSoup APIs, whereas CSS selection methods have been refactored into another base class (see below).

CSSSelector

The CSS selector should be able to select children without caring about whether this is the root document node or some nested child we're selecting from. Consequently, a new abstract CssSelectable class was introduced to express "anything that can yield result nodes based on a CSS selector query".

Interestingly enough, CssSelector itself is now a CssSelectable allowing quick, type-safe nesting in the DSL without immediate findFirst or findAll wrappers. Therefor, all selectors in the html5 package have also been changed to extend CssSelectable, regardless of whether it's actually a Doc, DocElement or even CssSelector in reality.

Neat side effects

  1. CssSelector does not need to hold a reference to the root document and construct one big query based on the root that selects nested files. It only holds the small part of the document that is within its current scope -- without knowing how deep in the scope that actually is. As a consequence, the full CSS selector needs to be compiled recursively on demand by asking the CssSelector's subject for its own "parent CSS selector" and appending its own CSS selector portion to that. Not particulary resource-intensive anyways. but important to mention.
  2. The findAll / findFirst / findSecond / etc. methods are now unified under CssSelectable! :tada:

Open questions

  1. In order to satisfy current testing setup, we have to "short-circuit" css selector creation for the root document node (i.e. have it return an empty string). JSoup would however normally try to compose a "meaningful" CSS selector even on root level -- compare skrape.it Doc#toCssSelector and JSoup Element#cssSelector. Which path should this project go down moving forward?
  2. findAll and findFirst are used as infix functions in some parts of the code. Kotlin only allows that with single-parameter methods that do not have a default value. For user convenience, would it make sense to ever want to skip these methods' parameters with a default value?
  3. DocElement currently has some "not-clearly-defined" behaviour that we need to discuss. Say you have two nested, identical tags <div> and you perform a simple div CSS selector lookup on the outer one. JSoup then returns results with the outer and the inner one -- this conflicts with skrape.it's idea of nesting that requires only "true" child nodes (i.e. without reflexive self-reference) to be returned. Which definition should take preference here?

Happy reviewing! :D

gregorbg commented 4 years ago

Also, ExperimentalDslTest fails on my machine due to not being able to fetch the Docker image. Is this a known issue?

gregorbg commented 4 years ago

I've done a minor refactor to better underline my point. As you can see, the DomTreeElement now queries all CSS selectors by children() (meaning without reflexive self-reference) and the tests do check out.

However, executing the query that way is a conscious choice and I think before merging you should give your explicit OK that this is a path you want to follow.

christian-draeger commented 4 years ago

I like the way how the connection to JSoup classes is separated more clearly than before. the type-safe nesting of elements without immediate findFirst or findAll wrappers is just awesome 🥇 :) That was an ache that bothered me for a long time, so I'm glad that's solved.

Regarding the names of the newly introduced classes im totally fine with that I in my opinion it's a good and concise naming.

Question#1: tbh I think it would be fine to return an empty string as root selector and for my understanding it would be more intuitive but I'm totally open for a discussion.

Question#2: What meaningful defaults would you want to put here?

Question#3: I would like to stick to the idea of nesting that as you said requires only "true" child nodes (i.e. without reflexive self-reference) to be returned. I think this is the most idiomatic way and how people would expect a DSL to work.

I'm going to merge this and want to say big big thanks for your contribution. you rock! 👍 if you are interested I would love to have you in the boat to discus further decisions regarding future features or architectural decisions

I also added you to the list of contributers on the skrape{it} docs page: https://docs.skrape.it/docs/about-skrape-it