Support Zorba as an alternative XML/HTML processing engine

scrapy / parsel

Parsel lets you extract data from XML/HTML documents using XPath or CSS selectors

BSD 3-Clause "New" or "Revised" License

1.15k stars 146 forks source link

Support Zorba as an alternative XML/HTML processing engine #29

Closed gerosalesc closed 5 years ago

gerosalesc commented 8 years ago

This has been troubling me for some time now but I would like this project to support a more powerful XML/HTML processing engine as an alternative to Lxml. The only contender for lxml in Python: Zorba. But why?

Zorba supports XQuery technology as well as JSONiq.
Zorba has Python bindings. I know they are not precisely the best bindings ever but at least they exist.
I think XPath 1.0 is very limited for more complex structures.
Lxml extensions are ok but not that much when compared to XQuery capabilities by default.
Zorba can be hosted as a service.

Ideally, we should be able to use selectors with Zorba in this way:

Selector(response=response).xquery('...').extract() or response.selector.xquery('...').extract()

eliasdorneles commented 8 years ago

Hello @gerosalesc !

So, to be fair I don't see lxml going away anytime soon, but this looks like a nice optional addition.

I'm not really familiar with Zorba nor its bindings, but this seems worth a proof-of-concept. Could you please point me to some use cases when supporting XQuery would give the biggest benefits?

Thank you!

gerosalesc commented 8 years ago

@eliasdorneles Hi there buddy. I have found myself in need of some of the features of XQuery when trying to do serious stuff to get the value from high complex HTML pages.

Let's say for example the FLWOR syntax, that alone would allow us to sort the values of a list of elements, not to mention that you can actually get more complex structures returned and perform some interesting data comparisons and transformations with functions of XPATH 2.0 which is supported by XQuery by default.

I understand that we are highly coupled but I think this change would take this library to a whole world of new possibilities.

For a PoC I see myself using XQuilla bindings because is seems to be easier. BTW you guys should consider XQuilla as well as Zorba.

Gallaecio commented 5 years ago

https://github.com/28msec/zorba seems dead, should we close this?