Closed gregorbg closed 4 years ago
Also, ExperimentalDslTest
fails on my machine due to not being able to fetch the Docker image. Is this a known issue?
I've done a minor refactor to better underline my point. As you can see, the DomTreeElement
now queries all CSS selectors by children()
(meaning without reflexive self-reference) and the tests do check out.
However, executing the query that way is a conscious choice and I think before merging you should give your explicit OK that this is a path you want to follow.
I like the way how the connection to JSoup classes is separated more clearly than before. the type-safe nesting of elements without immediate findFirst or findAll wrappers is just awesome 🥇 :) That was an ache that bothered me for a long time, so I'm glad that's solved.
Regarding the names of the newly introduced classes im totally fine with that I in my opinion it's a good and concise naming.
Question#1: tbh I think it would be fine to return an empty string as root selector and for my understanding it would be more intuitive but I'm totally open for a discussion.
Question#2: What meaningful defaults would you want to put here?
Question#3: I would like to stick to the idea of nesting that as you said requires only "true" child nodes (i.e. without reflexive self-reference) to be returned. I think this is the most idiomatic way and how people would expect a DSL to work.
I'm going to merge this and want to say big big thanks for your contribution. you rock! 👍 if you are interested I would love to have you in the boat to discus further decisions regarding future features or architectural decisions
I also added you to the list of contributers on the skrape{it} docs page: https://docs.skrape.it/docs/about-skrape-it
[EDIT] Removed the
DocElement
hack, still need to debate the underlying question though.Fixes #86, #79
This PR aims to introduce a compact class hierarchy allowing for nested/recursive CSS selections. The outwards API should not change and all the currently existing tests pass without modification.
General note: The names of the newly introduced classes are totally open for debate. Especially for
DomTreeElement
I'm struggling to find anything better.Class design
JSoup / DOM
The former
Doc
andDocElement
classes are now both inheriting from a newDomTreeElement
class based on the fact that both classes exposeJSoup
Element
s.The common
DomTreeElement
parent class exposes the relevant fraction of theseJSoup
APIs, whereas CSS selection methods have been refactored into another base class (see below).CSSSelector
The CSS selector should be able to select children without caring about whether this is the root document node or some nested child we're selecting from. Consequently, a new abstract
CssSelectable
class was introduced to express "anything that can yield result nodes based on a CSS selector query".Interestingly enough,
CssSelector
itself is now aCssSelectable
allowing quick, type-safe nesting in the DSL without immediatefindFirst
orfindAll
wrappers. Therefor, all selectors in thehtml5
package have also been changed to extendCssSelectable
, regardless of whether it's actually aDoc
,DocElement
or evenCssSelector
in reality.Neat side effects
CssSelector
's subject for its own "parent CSS selector" and appending its own CSS selector portion to that. Not particulary resource-intensive anyways. but important to mention.findAll
/findFirst
/findSecond
/ etc. methods are now unified underCssSelectable
! :tada:Open questions
Doc#toCssSelector
and JSoupElement#cssSelector
. Which path should this project go down moving forward?findAll
andfindFirst
are used asinfix
functions in some parts of the code. Kotlin only allows that with single-parameter methods that do not have a default value. For user convenience, would it make sense to ever want to skip these methods' parameters with a default value?DocElement
currently has some "not-clearly-defined" behaviour that we need to discuss. Say you have two nested, identical tags<div>
and you perform a simplediv
CSS selector lookup on the outer one. JSoup then returns results with the outer and the inner one -- this conflicts with skrape.it's idea of nesting that requires only "true" child nodes (i.e. without reflexive self-reference) to be returned. Which definition should take preference here?Happy reviewing! :D