shepmaster / sxd-xpath

An XPath library in Rust
Apache License 2.0
119 stars 34 forks source link

Implement XPath on traits instead of concrete types #120

Open vandenoever opened 6 years ago

vandenoever commented 6 years ago

sxd-xpath works with sxd-dom. Data needs to be converted to an sxd-dom before an XPath can be run on it.

If sxd-path would work on traits, it could be used on any data structure that implements those traits.

The traits might look something like this:

pub trait Node<'a> {
    fn as_attribute(&self) -> Option<&Self>
    where
        Self: Attribute<'a>,
    {
        None
    }
    fn as_element(&self) -> Option<&Self>
    where
        Self: Element<'a>,
    {
        None
    }
    fn as_text(&self) -> Option<&Self>
    where
        Self: Text<'a>,
    {
        None
    }
}

pub trait QName {
    fn namespace_uri(&self) -> Option<&str>;
    fn local_part(&self) -> &str;
}

pub trait NamedNode<'a>: Node<'a> {
    type QName: QName;
    fn name(&self) -> &QName;
}

pub trait Attribute<'a>: NamedNode<'a> {
    type AttributeValue: Into<String>;
    fn value(&self) -> &Self::AttributeValue;
}

pub trait Element<'a>: NamedNode<'a> {
    type Attribute: Attribute<'a> + 'a;
    type AttributeIter: Iterator<Item = &'a Self::Attribute>;
    type Child: Node<'a> + 'a;
    type ChildIter: Iterator<Item = &'a Self::Child>;

    fn attributes(&'a self) -> Self::AttributeIter;
    fn children(&'a self) -> Self::ChildIter;
}

pub trait Text<'a>: Node<'a> {
    fn data(&self) -> &str;
}
leoschwarz commented 6 years ago

I guess the biggest motivation for doing this would be decoupling the XPath parser and evaluator from sxd-document so that it could be used with other backends too, I suspect the main motivation would be to be able to parse huge documents not fitting into memory? I don't think the use case of using XPath against any data is as common as that it would justify such a change in itself (abstraction like this makes code more complex), and I think having anyone wanting something like that create a document manually is the most reasonable decision.

I wonder if this is the right approach here though, since I don't know how feasible it is to evaluate XPath without a DOM, I can see a lot of complications as it's possible to have both forward and backward dependency in XPath queries. If there is already an example of a library providing this or a specification of how this would have to be done properly, that would be really valuable.

shepmaster commented 6 years ago

so that it could be used with other backends too

The biggest I've heard of would be html5ever, which is indeed a DOM structure.

huge documents not fitting into memory

Having a "streaming XPath" is a truly interesting idea, but I'm not sure how one would go about it. As you mention:

it's possible to have both forward and backward dependency in XPath queries

It's definitely not possible for an arbitrary XPath to be applied in such a manner, so we'd have to either limit the input or determine if a given XPath is "streamable".

an example of a library providing this or a specification of how this would have to be done properly, that would be really valuable.

Agreed.

shepmaster commented 6 years ago

If someone really did want to apply these against html5ever, I think the strongest path would be to spin up a branch that just wildly hacks this crate to work against those nodes. That would give very concrete ideas to what kind of abstraction is needed.

vandenoever commented 6 years ago

The C++ library Qt supports XQuery (and XPath) on classes that derive from QAbstractXmlNodeModel.

http://doc.qt.io/qt-5/qabstractxmlnodemodel.html

http://doc.qt.io/qt-5/xquery-introduction.html

I suspect the main motivation would be to be able to parse huge documents not fitting into memory?

A backend that can place cursors in enormous documents would allow this. This might have indexes on nodes. XML databases do this.

shepmaster commented 6 years ago

on classes that derive from QAbstractXmlNodeModel.

Do you know of any other concrete implementations of that base model? I see QSimpleXmlNodeModel, but is there a way to tell if this is used anywhere else?

vandenoever commented 6 years ago

There is one for HTML documents: https://github.com/jgehring/qhtmlnodemodel

Qt comes with an example for file trees: https://code.woboq.org/qt5/qtxmlpatterns/examples/xmlpatterns/filetree/filetree.cpp.html

Here's a blog with the rationale for the use of an abstract node model: https://englich.wordpress.com/2007/11/15/query-your-toaster/

shepmaster commented 6 years ago

Cool, thank you! What was your specific usecase that made you originally open this issue?

vandenoever commented 6 years ago

KDE has a few uses of it. One maps binary MS Office documents to a QAbstractXmlNodeModel.

https://lxr.kde.org/ident?_i=QAbstractXmlNodeModel

https://lxr.kde.org/source/playground/libs/binschema/cpp/msoxmlnodemodel.cpp

vandenoever commented 6 years ago

I was thinking of doing some XPath code and noticed quite a few XML implementations in Rust. Quite a few developers have started XML parsers and doms with different trade-offs. For each of them, adding XPath is quite a task. For developers that want to use XPath in Rust code, there's not so much choice.

My concrete use case at the time was working with gigabyte spreadsheets. I ended up parsing into a special struct and had to forgo the convenience of xsd-xpath.