Discussion on implementing selectolax support

scrapy / parsel

Parsel lets you extract data from XML/HTML documents using XPath or CSS selectors

BSD 3-Clause "New" or "Revised" License

1.12k stars 144 forks source link

Discussion on implementing selectolax support #239

Open deepakdinesh1123 opened 2 years ago

deepakdinesh1123 commented 2 years ago

Here are some of the changes I thought of implementing

High level changes -

Selector class takes a new argument "parser" which indicates which parser backend to use (lxml or selectolax).
Selectolax itself provides two backends Lexbor and Modest by default it uses the Modest backend. Should additional support for lexbor be added? We could use modest by default and have the users pass an argument if they want to use lexbor
If the "parser" argument is not provided lxml will be used by default, since I thought it preserves the current behavior and allows backward support. It also allows the test suite to be used without changes to all the existing methods.
If the xpath method is called on a selector instantiated with selectolax as parser raise NotImplementedError.

Low level changes -

Add selectolax to the list of parsers in _ctgroup and modify create_root_node to instantiate the selected parser with the provided data.
Modify the xpath and css methods behavior to use both selectolax and lxml or write separate methods or classes to handle them.
Utilize HTMLParser class in Selectolax and its css method to apply the css expression specified and return the data collected.
Create a Selectorlist with Selector objects created with the type and parser specified.

This is still a work in progress and I will make a lot of changes, Please suggest the changes that need to made to the current list

Gallaecio commented 2 years ago

Your plan so far is looking good to me.

Some thoughts:

I assume the supported CSS Selector syntax supported by different parsers will be different. I think we will need to document these differences in as much detail as possible. Maybe those differences could guide further improvements to https://github.com/scrapy/cssselect.
I think changes in https://github.com/scrapy/parsel/pull/181 may be interesting to take into account. https://github.com/scrapy/parsel/pull/181 is a pull request to add JSON support to Parsel, and when designing your approach to multi-parser support you should also contemplate the possibility that in the future (possibly even before selectolax support is implemented) Parsel may get this other feature.
Will all Selector methods other than xpath be implementable with the new parser, or will we need to make more compromises?
What should be the behavior when XML input is passed and the selected parser is selectolax? Does it support arbitrary XML, or only XHTML5?

deepakdinesh1123 commented 2 years ago

JMESPath and JSON support -

since #237 is also open should I account for it too and modify the API to allow selection of jsonpath or jmespath, or should I just just consider jmespath for now?
If both parsers are to be supported, it might worth looking into a better way of implementing parser specific behavior than just adding conditions to the existing methods as these parsers are very different than lxml and selectolax.
Also additional changes need to be made to both #181 and #237 if both of them are considered.
If only jmespath is to be considered for now, then support for it can be added by raising an Exception if a "parser" argument is provided for a Selector of type json.
Currently the text argument in Selector only takes string arguments but since JSON data can also be stored in dictionaries, should users be allowed to pass a dictionary as an argument to Selector?

The following methods may not be supported while using selectolax as parser -

register_namespace, remove_namespace - since selectolax explicitly specifies that it is a HTML5 parser it does not support any XML or XHTML5 features so registering and removing namespace is not supported
remove - selectolax does offer support to remove particular tags in the provided data but it removes all the instances of the specified tags so deleting just one instance of a particular tag is not possible.

XML and XHTML5:

I experimented with most of the css queries with sample XML and XHTML5 data and most of the queries work as they are supposed to.
But namespaced XML does not work and selectolax does not have any methods to remove them.
Should parsing xml and xhtml5 still be supported indicating to the user that namespaced XML or XHTML5 should not be passed? I think this could be avoided as selectolax explicitly specifies that it is a HTML5 parser.

Things that I want to add I to the list of changes:

The existing behavior of non standard CSS selectors will remain the same, since selectolax does not support them required code shall written to do so.
attrib method should be modified fetch all the attributes in a selected object returned by selectolax.

@Gallaecio Can you please review this and suggest the changes that I have to make?

Gallaecio commented 2 years ago

I did not mean for you to incorporate support for JSON to your project. I meant for you to review their pull requests to understand how they are implemented, in case that can inform your implementation of Selectolax support, i.e. so that you do not plan on implementing Selectolax in a way that will make it hard to support JSON in the future without backward-incompatible API changes.
Good analysis of methods with different behaviors. How do you plan to handle them? I am thinking maybe NotImplementedError for the namespace ones. For remove, I am not sure if I understand the differences between lxml and selectolax completely, but it sounds like maybe remove should work with SelectorList but not with Selector.
Regarding XML and XHTML5, as long as things “fail as expected” (e.g. raise an error is content and parser are not a good fit, instead of accepting the input but selectors returning no match), I think we are good.
The existing behavior of non standard CSS selectors will remain the same, since selectolax does not support them required code shall written to do so.

Have you checked how this can be implemented? Does selectolax support extending their CSS syntax in any way? Does that mean that we could technically add XPath 1.0 support to selectolax (even though that would probably be out of the scope of your project)?
You talk about attrib, but seeing selectolax has node.attributes I imagine that part will be trivial to implement. However, I see now that selectolax has node.text(), whereas in Parsel we use text() in XPath and non-standard ::text in CSS. I wonder what approach we can take to allow text extraction with selectolax as parser. And text extraction is probably the single most important aspect of extraction in a web scraping context.

deepakdinesh1123 commented 2 years ago

I assume the supported CSS Selector syntax supported by different parsers will be different. I think we will need to document these differences in as much detail as possible. Maybe those differences could guide further improvements to https://github.com/scrapy/cssselect.

I experimented with both lxml and selectolax and got the following results

Selector LXML selectolax

::first-letter x x

::first-line x x

:lang(language) ✓ x

:optional x ✓

::placeholder x x

:read-only x ✓

:read-write x ✓

:required x ✓
maybe remove should work with SelectorList but not with Selector

In my previous comment I had written that

remove - selectolax does offer support to remove particular tags

But this was an error on my part and I should have checked the documentation thoroughly, Selectolax's Node class provides two methods Node.decompose() and Node.remove() ( Node.remove() is an alias to Node.decompose() ) which remove a particular node from the HTML tree. So remove can be supported in both SelectorList and Selector.
I looked into adding support for non-standard CSS selectors in selectolax but Selectolax mostly uses cython and I have no experience with it, so I thought of implementing the support by manually processing the query for a Selector of type Selectolax. Regarding the addition of support for XPath 1.0, I do not know if it can be added.
what approach we can take to allow text extraction with selectolax as the parser

Since only CSS selectors are supported when selectolax is used, supporting ::text for text extraction would allow similar behavior to exist between different parsers. Internally Node.html can be used to extract HTML data when ::text is not used and Node.text() can be used to extract text from selected node when ::text is used.

Selector	LXML	selectolax
::first-letter	x	x
::first-line	x	x
:lang(language)	✓	x
:optional	x	✓
::placeholder	x	x
:read-only	x	✓
:read-write	x	✓
:required	x	✓

Gallaecio commented 2 years ago

I looked into adding support for non-standard CSS selectors in selectolax [and] I thought of implementing the support by manually processing the query for a Selector of type Selectolax.

So you are thinking of transforming an input expression with unsupported syntax into an equivalent expression with supported syntax before you pass it to Selectolax? That may work for unsupported syntax that is syntax sugar, but I do not see how that would work for things that cannot be otherwise expressed with supported syntax (::text is a good example).

Since only CSS selectors are supported when selectolax is used, supporting ::text for text extraction would allow similar behavior to exist between different parsers. Internally Node.html can be used to extract HTML data when ::text is not used and Node.text() can be used to extract text from selected node when ::text is used.

I think you may be overestimating how complex CSS Selector expressions can be. For example, how do you handle ::attr(href), ::text (a single expression to match any href attribute and any text)?

deepakdinesh1123 commented 2 years ago

So you are thinking of transforming an input expression with unsupported syntax into an equivalent expression with supported syntax before you pass it to Selectolax?

I was thinking of applying the CSS expression by removing the ::text in the expression and extracting the text using Node.text() on the obtained node, for ::attr(name), apply the expression without ::attr(name) and extracting the attribute using Node.attributes[name] on the obtained node.

Gallaecio commented 2 years ago

I think you may be overestimating how complex CSS Selector expressions can be. For example, how do you handle ::attr(href), ::text (a single expression to match any href attribute and any text)?