scrapy / parsel

Parsel lets you extract data from XML/HTML documents using XPath or CSS selectors
BSD 3-Clause "New" or "Revised" License
1.12k stars 144 forks source link

Discussion on implementing selectolax support #239

Open deepakdinesh1123 opened 2 years ago

deepakdinesh1123 commented 2 years ago

Here are some of the changes I thought of implementing

High level changes -

  1. Selector class takes a new argument "parser" which indicates which parser backend to use (lxml or selectolax).
  2. Selectolax itself provides two backends Lexbor and Modest by default it uses the Modest backend. Should additional support for lexbor be added? We could use modest by default and have the users pass an argument if they want to use lexbor
  3. If the "parser" argument is not provided lxml will be used by default, since I thought it preserves the current behavior and allows backward support. It also allows the test suite to be used without changes to all the existing methods.
  4. If the xpath method is called on a selector instantiated with selectolax as parser raise NotImplementedError.

Low level changes -

  1. Add selectolax to the list of parsers in _ctgroup and modify create_root_node to instantiate the selected parser with the provided data.
  2. Modify the xpath and css methods behavior to use both selectolax and lxml or write separate methods or classes to handle them.
  3. Utilize HTMLParser class in Selectolax and its css method to apply the css expression specified and return the data collected.
  4. Create a Selectorlist with Selector objects created with the type and parser specified.

This is still a work in progress and I will make a lot of changes, Please suggest the changes that need to made to the current list

Gallaecio commented 2 years ago

Your plan so far is looking good to me.

Some thoughts:

deepakdinesh1123 commented 2 years ago

JMESPath and JSON support -

The following methods may not be supported while using selectolax as parser -

  1. register_namespace, remove_namespace - since selectolax explicitly specifies that it is a HTML5 parser it does not support any XML or XHTML5 features so registering and removing namespace is not supported
  2. remove - selectolax does offer support to remove particular tags in the provided data but it removes all the instances of the specified tags so deleting just one instance of a particular tag is not possible.

XML and XHTML5:

Things that I want to add I to the list of changes:

  1. The existing behavior of non standard CSS selectors will remain the same, since selectolax does not support them required code shall written to do so.
  2. attrib method should be modified fetch all the attributes in a selected object returned by selectolax.

@Gallaecio Can you please review this and suggest the changes that I have to make?

Gallaecio commented 2 years ago
deepakdinesh1123 commented 2 years ago
Gallaecio commented 2 years ago

I looked into adding support for non-standard CSS selectors in selectolax [and] I thought of implementing the support by manually processing the query for a Selector of type Selectolax.

So you are thinking of transforming an input expression with unsupported syntax into an equivalent expression with supported syntax before you pass it to Selectolax? That may work for unsupported syntax that is syntax sugar, but I do not see how that would work for things that cannot be otherwise expressed with supported syntax (::text is a good example).

Since only CSS selectors are supported when selectolax is used, supporting ::text for text extraction would allow similar behavior to exist between different parsers. Internally Node.html can be used to extract HTML data when ::text is not used and Node.text() can be used to extract text from selected node when ::text is used.

I think you may be overestimating how complex CSS Selector expressions can be. For example, how do you handle ::attr(href), ::text (a single expression to match any href attribute and any text)?

deepakdinesh1123 commented 2 years ago

So you are thinking of transforming an input expression with unsupported syntax into an equivalent expression with supported syntax before you pass it to Selectolax?

I was thinking of applying the CSS expression by removing the ::text in the expression and extracting the text using Node.text() on the obtained node, for ::attr(name), apply the expression without ::attr(name) and extracting the attribute using Node.attributes[name] on the obtained node.

Gallaecio commented 2 years ago

I think you may be overestimating how complex CSS Selector expressions can be. For example, how do you handle ::attr(href), ::text (a single expression to match any href attribute and any text)?