mischov / meeseeks

An Elixir library for parsing and extracting data from HTML and XML with CSS or XPath selectors.
MIT License
314 stars 23 forks source link

Unable to select attributes via XPath #85

Closed OldhamMade closed 4 years ago

OldhamMade commented 4 years ago
iex> tree = "<p class=\"foo\">asd</p>" |> Meeseeks.parse()
#Meeseeks.Document<{...}>
iex> tree |> Meeseeks.one(xpath("//p")) |> Meeseeks.text()
"asd"
iex> tree |> Meeseeks.one(xpath("//p/text()")) |> Meeseeks.text()
"asd"
iex> tree |> Meeseeks.one(xpath("//p/@class")) |> Meeseeks.text()
** (ArgumentError) you attempted to apply :id on {"class", "foo"}. If you are using apply/3, make sure the module is
an atom. If you are using the dot syntax, such as map.field or module.function, make sure the left side of the dot is
 an atom or a map
    :erlang.apply({"class", "foo"}, :id, [])

One would expect @class to work the same as text() since the result is always text.

mischov commented 4 years ago

@OldhamMade There does seem to be something odd going on here- I'll try to figure out what.

I don't know that selecting attributes via /@attribute is implemented.

mischov commented 4 years ago

@OldhamMade To follow up a little, no, selecting an attribute with /@attribute is not supported currently.

There are several reasons this is the case.

Firstly, while in the XML worldview everything is a node and so the attributes are kinds of nodes that can be selected like any other node, in Meeseeks attributes aren't represented as nodes (they're stored in element nodes).

Secondly, both css and xpath selectors are used for selecting nodes from a Meeseeks document, not extracting data from those nodes. The selectors find nodes that match, then return a result, which is a pointer to a node in the document. You can then extract data based on the returned result.

In the above example you are able to select a text node (which Meeseeks does represent as a node), then use Meeseeks.text to extract the text from that text node, but because attributes aren't nodes all you can do is select the element containing the attribute (ie //p[@class]), then extract the attribute you want with Meeseeks.attr.

Given the fundamental difference in whether attributes are represented as nodes or not, I don't know if it would make sense to try and support that functionality, and I'm not really sure how that could even be achieved without fundamentally changing Meeseeks's design.

Unless some other reasonable solution can be determined, what I will do is:

  1. Note that /@attribute isn't supported in the XPath documentation
  2. Return an error when parsing such an XPath selector

Thank you for your report, OldhamMade.

OldhamMade commented 4 years ago

Understood. Thanks for looking into it!

mischov commented 4 years ago

Fixed in v0.15.0.