Open deepakdinesh1123 opened 2 years ago
Your plan so far is looking good to me.
Some thoughts:
Selector
methods other than xpath
be implementable with the new parser, or will we need to make more compromises?JMESPath and JSON support -
Exception
if a "parser" argument is provided for a Selector
of type json
.Selector
only takes string arguments but since JSON data can also be stored in dictionaries, should users be allowed to pass a dictionary as an argument to Selector
?The following methods may not be supported while using selectolax as parser -
register_namespace
, remove_namespace
- since selectolax explicitly specifies that it is a HTML5 parser it does not support any XML or XHTML5 features so registering and removing namespace is not supportedremove
- selectolax does offer support to remove particular tags in the provided data but it removes all the instances of the specified tags so deleting just one instance of a particular tag is not possible.XML and XHTML5:
Things that I want to add I to the list of changes:
attrib
method should be modified fetch all the attributes in a selected object returned by selectolax.@Gallaecio Can you please review this and suggest the changes that I have to make?
I did not mean for you to incorporate support for JSON to your project. I meant for you to review their pull requests to understand how they are implemented, in case that can inform your implementation of Selectolax support, i.e. so that you do not plan on implementing Selectolax in a way that will make it hard to support JSON in the future without backward-incompatible API changes.
Good analysis of methods with different behaviors. How do you plan to handle them? I am thinking maybe NotImplementedError
for the namespace ones. For remove
, I am not sure if I understand the differences between lxml and selectolax completely, but it sounds like maybe remove
should work with SelectorList
but not with Selector
.
Regarding XML and XHTML5, as long as things “fail as expected” (e.g. raise an error is content and parser are not a good fit, instead of accepting the input but selectors returning no match), I think we are good.
The existing behavior of non standard CSS selectors will remain the same, since selectolax does not support them required code shall written to do so.
Have you checked how this can be implemented? Does selectolax support extending their CSS syntax in any way? Does that mean that we could technically add XPath 1.0 support to selectolax (even though that would probably be out of the scope of your project)?
You talk about attrib
, but seeing selectolax has node.attributes
I imagine that part will be trivial to implement. However, I see now that selectolax
has node.text()
, whereas in Parsel we use text()
in XPath and non-standard ::text
in CSS. I wonder what approach we can take to allow text extraction with selectolax as parser. And text extraction is probably the single most important aspect of extraction in a web scraping context.
I assume the supported CSS Selector syntax supported by different parsers will be different. I think we will need to document these differences in as much detail as possible. Maybe those differences could guide further improvements to https://github.com/scrapy/cssselect.
I experimented with both lxml and selectolax and got the following results
Selector | LXML | selectolax |
---|---|---|
::first-letter | x | x |
::first-line | x | x |
:lang(language) | ✓ | x |
:optional | x | ✓ |
::placeholder | x | x |
:read-only | x | ✓ |
:read-write | x | ✓ |
:required | x | ✓ |
maybe
remove
should work withSelectorList
but not withSelector
In my previous comment I had written that
remove
- selectolax does offer support to remove particular tags
But this was an error on my part and I should have checked the documentation thoroughly, Selectolax's Node class provides two methods Node.decompose()
and Node.remove()
( Node.remove()
is an alias to Node.decompose()
) which remove a particular node from the HTML tree. So remove
can be supported in both SelectorList
and Selector
.
I looked into adding support for non-standard CSS selectors in selectolax but Selectolax mostly uses cython and I have no experience with it, so I thought of implementing the support by manually processing the query for a Selector
of type Selectolax. Regarding the addition of support for XPath 1.0, I do not know if it can be added.
what approach we can take to allow text extraction with selectolax as the parser
Since only CSS selectors are supported when selectolax is used, supporting ::text
for text extraction would allow similar behavior to exist between different parsers. Internally Node.html
can be used to extract HTML data when ::text
is not used and Node.text()
can be used to extract text from selected node when ::text
is used.
I looked into adding support for non-standard CSS selectors in selectolax [and] I thought of implementing the support by manually processing the query for a Selector of type Selectolax.
So you are thinking of transforming an input expression with unsupported syntax into an equivalent expression with supported syntax before you pass it to Selectolax? That may work for unsupported syntax that is syntax sugar, but I do not see how that would work for things that cannot be otherwise expressed with supported syntax (::text
is a good example).
Since only CSS selectors are supported when selectolax is used, supporting ::text for text extraction would allow similar behavior to exist between different parsers. Internally Node.html can be used to extract HTML data when ::text is not used and Node.text() can be used to extract text from selected node when ::text is used.
I think you may be overestimating how complex CSS Selector expressions can be. For example, how do you handle ::attr(href), ::text
(a single expression to match any href
attribute and any text)?
So you are thinking of transforming an input expression with unsupported syntax into an equivalent expression with supported syntax before you pass it to Selectolax?
I was thinking of applying the CSS expression by removing the ::text
in the expression and extracting the text using Node.text()
on the obtained node, for ::attr(name)
, apply the expression without ::attr(name)
and extracting the attribute using Node.attributes[name]
on the obtained node.
I think you may be overestimating how complex CSS Selector expressions can be. For example, how do you handle
::attr(href), ::text
(a single expression to match anyhref
attribute and any text)?
Here are some of the changes I thought of implementing
High level changes -
Selector
class takes a new argument "parser" which indicates which parser backend to use (lxml or selectolax).xpath
method is called on a selector instantiated with selectolax as parser raiseNotImplementedError
.Low level changes -
_ctgroup
and modifycreate_root_node
to instantiate the selected parser with the provided data.xpath
andcss
methods behavior to use both selectolax and lxml or write separate methods or classes to handle them.HTMLParser
class in Selectolax and itscss
method to apply the css expression specified and return the data collected.Selectorlist
withSelector
objects created with the type and parser specified.This is still a work in progress and I will make a lot of changes, Please suggest the changes that need to made to the current list