w3c / web-annotation

Web Annotation Working Group repository, see README for links to specs
https://w3c.github.io/web-annotation/
Other
141 stars 30 forks source link

XPath Selector #95

Closed azaroth42 closed 8 years ago

azaroth42 commented 8 years ago

Also discussed at TPAC, a range selector using start and end XPaths. A single XPath can already be handled via the FragmentSelector, but it was determined that even if all useful ranges were able to be expressed in XPointer, this functionality is not readily available and two XPaths in a selector was more useful, and of minimal additional complexity to specify.

Example:

{
  "type": "XPathRangeSelector",
  "startPath": "/html/body/div[4]/p[1]/b",
  "endPath": "/html/body/div[4]/p[3]"
}
BigBlueHat commented 8 years ago

@tilgovi pointed out that when these hit the DOM Range API, there's also a commonAncestorContainer Node involved (and available to that API).

As such, we'll need to define how the text content found between startPath and endPath is defined, such that we can use a TextPositionSelector accurately (and consistently across implementations) as a subSelector (see #93).

We'll face a similar scenario with CSS Selector in #94.

Not impossible, but needs defining. :smile:

tilgovi commented 8 years ago

More importantly, startPath and endPath may each be insufficient to select the boundary points. I brought up (on the mailing list) the need for a more generic range selector that combines a start selector and and end selector (each could be of any type, and have their own subSelector if that becomes a thing) but no one has responded.

azaroth42 commented 8 years ago

I'd like to see use cases for those situations before adding in that level of complexity.

tilgovi commented 8 years ago

Hypothesis exemplifies this use case every day. Its XPath Range selector has boundaries defined by both XPath to a node and text offset within that node. That the XPath selects an Element and the offset a text position accounts for the unpredictability of the slicing of Text Nodes and the possible presence of other highlight spans. Arguably, an even better implementation would ignore all inline phrasing Elements and measure text offset from the beginning of the block.

Not letting each boundary be specified further than an XPath string effectively makes this selector unsuitable for use in Hypothesis.

azaroth42 commented 8 years ago

That's a particular implementation, not a use case. In order to determine the value vs complexity equation, it's necessary to understand the situations in which it is necessary and why the proposed solution is inadequate for those situations. Thanks Randall!

tilgovi commented 8 years ago

"That the XPath selects an Element and the offset a text position accounts for the unpredictability of the slicing of Text Nodes and the possible presence of other highlight spans. Arguably, an even better implementation would ignore all inline phrasing Elements and measure text offset from the beginning of the block."

Is that not a use case? Selecting text within a block such that differences in Text Node slicing and inline Elements don't break the selector?

azaroth42 commented 8 years ago

I don't see how text node slicing or inline elements affect TextPositionSelector, as the text MUST be normalized. The use case can be fulfilled without the additional XPath selector. The selector makes it significantly more likely to be accurate, without adding significant complexity. By specifying as a separate selector, it means we can alternatively use CssSelector to find the block. What does the proposal add, such that as a working group, we can evaluate the complexity of the proposal compared to the added value.

tilgovi commented 8 years ago

I grant that you can insist the selector be evaluated against normalized text Nodes. That's a good point.

But my understanding of XPath doesn't include any way to measure text offsets from Elements, only within Text Nodes, though I may be missing something.

The selector can be more robust against formatting changes by ignoring inline content. This requires measuring from some ancestor Element, a block element in this example. One might get more particular, measuring from the start of an article or main tag, or a p tag ancestor.

Making Range generic and letting XPath stand alone would mean that you could describe the boundaries as CSS or XPath and optionally offsets therefrom.

liamquin commented 8 years ago

On 2015-11-03 20:04, Randall Leeds wrote: [...]

But my understanding of XPath doesn't include any way to measure text offsets from Elements, only within Text Nodes, though I may be missing something.

(Sorry if I'm chiming in without enough context here)

I am not sure what you mean by measuring text offsets from elements. I'll guess that, in

The happy boy jumped for joy when he saw the cheesecake!

if you're annotating "boy" you want to count the number of characters in "the happy " beforehand? In which case yes, Xpath 1 can do that;

string-length(substring-before(., "boy")) for example.

You can use string-length() and substring() and (for more robustness perhaps) substring-before() on the string value of any node, including the entire subtree.

The selector can be more robust against formatting changes by ignoring inline content.

That makes it robust against text changes that preserve markup, but not against markup changes. It's a tradeoff. Using @id values can help with robustness in many (not all) environments.

This requires measuring from some ancestor Element, a block element in this example. One might get more particular, measuring from the start of an article or main tag, or a p tag ancestor.

A traditional way to do this in hypertext, structured editors and elsewhere is with tumblers; there's considerable implementation experience, some of which was reflected in XPointer. What does it mean to "measure" a tree? Remember that (at least in theory) text/html may get line endings rewritten by proxies, so using normalized text is essential.

Making Range generic and letting XPath stand alone would mean that you could describe the boundaries as CSS or XPath and optionally offsets therefrom.

OK, that makes sense I think,

Liam

Liam Quin, W3C XML Activity Lead; Digital publishing; HTML Accessibility

fhirsch commented 8 years ago

should I be getting nervous about XPath and the potential need for possible text normalization and canonicalization? Are there performance costs associated with normalization/canonicalization and can they be avoided?

Does findtext eliminate the need for XPath in our use cases? Can it? should it?

http://w3c.github.io/findtext/

tilgovi commented 8 years ago

What I was missing was the string function and the fact that the string-value of an element includes the concatenation of the string-value of each of its children.

Using a construction like string(/html/body/article/p[1]) would fulfill my use case.

So, for instance, the range that encompasses the text between the 10th character of the first paragraph and the 8th character of the third paragraph, each of the article:

{
  "type": "XPathRangeSelector",
  "startPath": "substring(string(//article/p[1]), 10)",
  "endPath": "substring(string(//article/p[3]), 0, 8)"
}

The only problem with this in practice for DOM selections is that the path expressions would have to be tokenized manually first since the standard XPath APIs would give you a string result for each of these.

For example, to process this example for the use case of highlighting the resulting range, one would need to write code to parse out the //article/p[n] references and the substring function offsets, use the DOM XPath APIs to look up the node references and then iterate text nodes to find the container/offset suitable for constructing a DOM Range.

That's doable, but the first step could be avoided if the startPath and endPath are each selectors that point to a node with sub-selectors that specify the text offsets within. It avoids the need to tokenize the paths.

I could live with it, though.

tilgovi commented 8 years ago

I would drop the Range from the name, though, since we're not calling it TextPositionRange or DataPositionRange even those both of those also describe a range of characters or octets.

azaroth42 commented 8 years ago

:+1: to dropping "Range"

tkanai commented 8 years ago

@tilgovi I'm uncertain whether the pointed objects would be included or excluded in/from the range. Could you tell me how to select "I (love)" words, or both "I" and the heart mark Image, from the html text below with the XPathSelector? I also would like to make sure how to select "I" only. <p>I <img src="love.png" /> New York</p>

As I frequently encounter such paragraphs while I'm reading Japanese eBooks, although images are "Kanji" characters, I am looking for an appropriate selector which can be applicable for non-normalized HTML documents.

tbdinesh commented 8 years ago

"startPath": "substring(string(//article/p[1]), 10)"

substring(string(//article/p[1]), 10) is not a path, right? So we need to have both the path and the offset - something like

"startPath": "//article/p[1], offset(string,10)",

On Thu, Nov 5, 2015 at 3:15 AM, Randall Leeds notifications@github.com wrote:

What I was missing was the string function and the fact that the string-value of an element includes the concatenation of the string-value of each of its children.

Using a construction like string(/html/body/article/p[1]) would fulfill my use case.

So, for instance, the range that encompasses the text between the 10th character of the first paragraph and the 8th character of the third paragraph, each of the article:

{ "type": "XPathRangeSelector", "startPath": "substring(string(//article/p[1]), 10)", "endPath": "substring(string(//article/p[3]), 0, 8)" }

The only problem with this in practice for DOM selections is that the path expressions would have to be tokenized manually first since the standard XPath APIs would give you a string result for each of these.

For example, to process this example for the use case of highlighting the resulting range, one would need to write code to parse out the //article/p[n] references and the substring function offsets, use the DOM XPath APIs to look up the node references and then iterate text nodes to find the container/offset suitable for constructing a DOM Range.

That's doable, but the first step could be avoided if the startPath and endPath are each selectors that point to a node with sub-selectors that specify the text offsets within. It avoids the need to tokenize the paths.

I could live with it, though.

— Reply to this email directly or view it on GitHub https://github.com/w3c/web-annotation/issues/95#issuecomment-153876212.

tilgovi commented 8 years ago

You're right.

I had similar concern that these are valid XPath expressions but they are not location paths and they evaluate to strings, not node sets.

I don't think there is any path expression that selects characters. That's outside of the XPath model, and into XPointer territory.

A generic range with two boundaries, each an XPath subSelector TextPosition, still seems like a correct description of a text selection to me.

BigBlueHat commented 8 years ago

@tilgovi so...something like:

{
  "selector": {
    "type": "Range",
    "startSelector": {
      "type": "XPathSelector",
      "path": "//article/p[1]",
      "subSelector": {
        "type": "TextPositionSelector",
        "start": 10,
        "end": 10
      }
    },
   "endSelector": {
      "type": "XPathSelector",
      "path": "//article/p[3]",
      "subSelector": {
        "type": "TextPositionSelector",
        "start": 0,
        "end": 8
      }
    }
  }
}

That seems to actually specify a node and a point within it and another node and a range within it.

Personally, I found this construction to make more sense

 {
  "selector": {
    "type": "XPathSelector",
    "startPath": "//article/p[1]",
    "endPath": "//article/p[3]",
    "subSelector": {
      "type": "TextPositionSelector",
      "start": 10,
      "end": "...whatever the end # would be within the normalized output of the text between p[1] & p[3]..."
    }
  }
}

Obviously things fall down (currently) for the value of end in the TextPositionSelector...but that seems definable.

Do those examples present both sides accurately? In either case, what am I missing (which I'm sure is something :smile: )?

tilgovi commented 8 years ago

As can often be the case with selections and anchoring, it's hard to know what the most "semantic" selection description is. Should it be character offsets within the text of two elements and all those between them? Or should you measure offsets from (either after or before) the end elements?

Your first one gets closest to what I would use for describing DOM Ranges. I would change both TextPositionSelectors to be zero-width points. The selection is the range of text between (left inclusive) in document order.

That's just my pick for the most faithful representation of the DOM object, not the most semantic description of the user intent, we may be to select from the start of one phrase to the end of another.

And if we really wanted a faithful representation of DOM Range we would select the container node for each boundary point any way we please (fragment, css, xpath) and then describe the offset either in terms of a text position or an n-th child sort of construction in css or xpath.

Your second example is reasonable, but describes things differently. It would also not be precluded by the start and end being their own selectors.The start and end being selectors and sub-selecting from the range are two different needs.

A DOM Range, boundary points in the text.

{
  "selector": {
    "type": "Range",
    "startSelector": {
      "type": "XPathSelector",
      "path": "//article/p[1]/text()",
      "subSelector": {
        "type": "TextPositionSelector",
        "start": 10,
        "end": 10
      }
    },
   "endSelector": {
      "type": "XPathSelector",
      "path": "//article/p[3]/text()",
      "subSelector": {
        "type": "TextPositionSelector",
        "start": 8,
        "end": 8
      }
    }
  }
}

A more "human" quote range, just as an example:

{
  "selector": {
    "type": "Range",
    "startSelector": {
      "type": "CssSelector",
      "cssSelector": "article",
      "subSelector": {
        "type": "TextQuoteSelector",
        "exact": "And so it is with"
      }
    },
   "endSelector": {
      "type": "CssSelector",
      "cssSelector": "article",
      "subSelector": {
        "type": "TextQuoteSelector",
        "exact": "the cuteness of kittens."
      }
    }
  }
}

A generic Range is really the flexible thing. It allows some new constructions that could turn out to be quite useful.

BigBlueHat commented 8 years ago

@tilgovi the last example feels very odd to me, but I can certainly see the case for the generic Range class.

Perhaps it needs it's own issue at this point. :smile:

tilgovi commented 8 years ago

You're right. It is odd. It would probably be better constructed as CssSelector subSelector RangeSelector(TextQuoteSelector, TextQuoteSelector) rather than repeating the CSS bit.

BigBlueHat commented 8 years ago

Also, I don't want us to overlook @tkanai's point about "looking for an appropriate selector which can be applicable for non-normalized HTML documents."

We don't have one of those in the model yet...and we SHOULD...or perhaps that's a MUST. :wink:

tilgovi commented 8 years ago

I don't think it detracts from my point, though, which was to show how using TextQuoteSelector (or any other selector) for each boundary point may be a useful, semantic way to describe a range in some use cases.

To summarize my feelings about the current proposal for XPathSelector:

The last path component would be /node()[x] where x is the context position for the boundary offset.

Since the closest correct expression would evaluate to a string and what is needed is the Node context an implementation for DOM would need to tokenize the XPath expression into a context part and a string offset part. The former could be evaluated with document.evaluate and the latter parsed to determine the offsets. I think using XPath to select a node and using subSelector to select its character data would be easier on implementors.

I feel as though it would be useful for other media types, too, but not sure how to feel about the resulting overlap with other selectors that already describe ranges for certain media types like TextPositionSelector and DataPositionSelector.

BigBlueHat commented 8 years ago

@tilgovi after reading DOM Range and your work on xpath-range (and specifically the toRange() method), I'm beginning to understand your desire for a Range object.

Having flexibility in how we express the Range.startContainer and Range.endContainer as well as the ability to state that the selection should include more than the "stringification" (to address #107) seems key to solving the "DOM-based" use cases, simplifying the construction of selector-to-range code, and (possibly) making the whole thing more sensible to the watching world. :smile:

So :+1: to you creating an issue for a Range object in which we can stuff various selectors. :smile_cat:

BigBlueHat commented 8 years ago

@tilgovi do you have an interest (and/or time) to craft a Range proposal as a separate issue?

I'm going to drop Range from the name of this issue, and (hopefully!) refocus it on XPath Selector bits.

BigBlueHat commented 8 years ago

@tilgovi you might want to have a look at range() which is in the XPointer Framework Registry and works on XML docs. /cc @iherman

azaroth42 commented 8 years ago

Can we get a proposal for this for the 2016-01-27 call please? Otherwise postpone?

tilgovi commented 8 years ago

I think there are at least two proposals:

Here's a run-down of a summary and pros and cons for each.

XPointer

Use the XPointer fragment syntax with the fragment selector.

Pros:

Cons:

XPath selector

Pros:

Cons:

Range selector

Pros:

Cons:

tilgovi commented 8 years ago

I clearly rushed through that a little bit. That's three proposals (not two) and I only summarized the first. If anything needs more explaining, please let me know or ask some questions. Thanks!

azaroth42 commented 8 years ago

Thanks @tilgovi!

Given that XPointer is possible today, I don't think we need to discuss it? We're certainly not going to take it out of the model :) And then I think it would be good to split XPath and Range into separate issues, with example JSON for how they would be represented. Then we could close this issue and #107.

Agree?

BigBlueHat commented 8 years ago

:+1: works for me.

tilgovi commented 8 years ago

:+1:

azaroth42 commented 8 years ago

Ping. Can the XPath and Range issues be split out before the call on Friday please? Otherwise, can we postpone?

tilgovi commented 8 years ago

Opened #153 for the Range Selector.

tilgovi commented 8 years ago

I'll leave this open as the issue to represent an XPath Selector. Consider this and #153 to be the split discussed above, please!

azaroth42 commented 8 years ago

Given the separation of Range, and to also make this issue more concrete with a proposal by example:

{
  "type": "XPathSelector",
  "value": "/html/body/p[1]"
}

Thus a new class (XPathSelector, subClassOf Selector) and reuse rdfs:value to hold the value of the selector.

iherman commented 8 years ago

+1

iherman commented 8 years ago

While I am in favor of having a XPath selector, there are some issues we should be aware of if the WG accepts this proposal. These are all sub-issues that must be reflected, somehow, in the final document.

XPath and DOM

Formally, XPath is defined through a separate XPath datamodel document. That document, essentially, says that it relies on the (XML) infoset specification. That is an XML document, whereas HTML5 is not. I have asked our staff colleague (Carine), and this is what she said:

The Web Annotation WG can use the XPath/XQuery data model if they need to, as long as they carefully study compatibility with the constructs to which they want to apply it. We used to have such a document for DOM Level 3, https://www.w3.org/TR/DOM-Level-3-XPath/xpath.html

That could be a good starting point to evaluate whether DOM4 has departed too much from the original tree model. (I doubt it has)

I think the only thing we can/should do is to add a note in the spec, referring to the DOM 3 document so that authors/implementers should be aware of how the XPath is used and defined. (Note that there is no reference to XPath in the DOM4 spec.)

XPAth and HTML5

In any case, what this means is that XPath works on top of the DOM and not on top of the original HTML source. This is important to be emphasized in the spec, because the HTML5 parser may slightly rearrange the original HTML code, which may affect the validity of an XPath expression. A possible reference is:

https://www.w3.org/TR/html5/syntax.html

which describes the parser (and is therefore hell to read...). However, there are some important internal references to that section. One is:

https://www.w3.org/TR/html5/syntax.html#optional-tags

which lists the tags that may be missing in the HTML but will be added in the DOM (e.g., tbody element if missing). Anywhere that says a start tag can be omitted, it means the parser is going to add the element to the DOM, e.g., html head body colgroup, or tbody.

Another one is:

https://www.w3.org/TR/html5/syntax.html#an-introduction-to-error-handling-and-strange-cases-in-the-parser

with all kinds of nasty situation that the parser has to take care of (and which lead to DOM modifications).

Again, what we can/should do is to add a note in the document drawing attention to this type of problems.

Normative reference issue

Another problem is the status of the XPath documents (I mean the latest, 3.1. versions). At the moment, all documents are in CR, meaning that they would be inappropriate as normative references from a Rec. Some in the reference chain have been in CR for more than a year… However, here is the info I got from Carine:

… it's expected to go to PR along with the other ones in the near future […] Working closely with the developer community, we expect to show evidence of implementations by approximately 1 March 2016. […] It should be in PR before autumn 2016.

If that happens, then we may be fine. But we will have to keep an eye on this to see if there are delays...

azaroth42 commented 8 years ago

Added note with the HTML5 reference, and changed the XPath and DOM ref to [[DOM-Level-3-XPath]].

http://w3c.github.io/web-annotation/model/wd2/#xpath-selector

tilgovi commented 8 years ago

Super helpful, thanks @iherman.

iherman commented 8 years ago

On 25 Feb 2016, at 19:09, Randall Leeds notifications@github.com wrote:

Super helpful, thanks @iherman.

I think it would be useful to add the references I cited on what changes an HTML parser does when the DOM is created. It took me a while, and help from colleagues, to find those:-(, let us save the energy and the time of our readers...

tilgovi commented 8 years ago

Yes, agree. And we should maybe make a recommendation that clients implement the parsing and normalization of the DOM first and that the XPath Selector is intended to operate on the DOM and not the bytes.

tilgovi commented 8 years ago

It may go without saying, but explicit is always better.

iherman commented 8 years ago

Oops, either I missed the reference in the note yesterday evening, or @azaroth has added it later… In any case, what is there now is fine with me, my note below is moot.

Thanks

On 25 Feb 2016, at 20:09, Ivan Herman ivan@w3.org wrote:

On 25 Feb 2016, at 19:09, Randall Leeds <notifications@github.com mailto:notifications@github.com> wrote:

Super helpful, thanks @iherman https://github.com/iherman.

I think it would be useful to add the references I cited on what changes an HTML parser does when the DOM is created. It took me a while, and help from colleagues, to find those:-(, let us save the energy and the time of our readers...

iherman commented 8 years ago

Accept the proposal, telco 2016-02-26

See http://www.w3.org/2016/02/26-annotation-irc#T16-22-32