mischov / meeseeks

An Elixir library for parsing and extracting data from HTML and XML with CSS or XPath selectors.
MIT License
314 stars 23 forks source link

Provide an XML parser #12

Closed mischov closed 7 years ago

mischov commented 7 years ago

There are people out there trying to use Meeseeks or Floki with XML files, and that can lead to some confusing results (particularly in the case of meeseeks_html5ever and html5ever_elixir, which are HTML5 spec compliant).

Rather than have people reach for the wrong tool because it's the one readily at hand, I'd prefer to provide an XML parser.

Note: This problem is unrelated to issue #11, which involves HTML documents, not XML documents.

Solution

The html5ever project also has a permissive XML parser, xml5ever, and it should not be too complicated for meeseeks_html5ever to expose functions for parsing XML.

Meeseeks.parse

The least intrusive solution is to default Meeseeks.parse/1 to parsing HTML and providing a Meeseeks.parse/2 that additionally takes a keyword specifying how to parse

Meeseeks.parse(...) # Parses as HTML
Meeseeks.parse(..., :html) # Parses as HTML
Meeseeks.parse(..., :xml) # Parses as XML

Otherwise I would need to make a breaking change and add Meeseeks.parse_html/1 and Meeseeks.parse_xml/1 in the place of Meeseeks.parse/1.

Meeseeks.parse_html(...) # Parses as HTML
Meeseeks.parse_xml(...) # Parses as XML

I would prefer not making a breaking change on such a core concern, but maybe the explicitness of parse_html and parse_xml is worth it? Thoughts?

Meeseeks.all and Meeseeks.one

Currently Meeseeks.all and Meeseeks.one accept a string as a source and then attempt to parse that string as HTML.

My plan is to keep this behavior as is because it provides simple usage for the most common use case (selecting from HTML).

I can alternatively see arguments for disallowing string input and forcing the input to be parsed (because it doesn't hide parsing under the hood), or for defaulting to parsing as html but allowing some key in the context indicate document type, but again I would prefer to not make breaking changes. Anybody have a strong feeling on the subject?

mischov commented 7 years ago

In regards to the questions posed above, I added Meeseeks.parse/2 and continued to parse strings as HTML in the selection functions.

yanshiyason commented 4 years ago

I was running into this issue! I am parsing an xml file with PascalCase properties, but they were being returned lowercase, and certain tags like

<Image>
   <Small>...</Small>
   <Medium>...</Medium>
</Image>

were only selectable by doing

# works
Meeseeks.one(doc, "//small")
Meeseeks.one(doc, "//medium")

# does not work
Meeseeks.one(doc, "//image/small")
Meeseeks.one(doc, "//image/medium")

I will try again with the parse/2 function specifying xml.

mischov commented 4 years ago

@yanshiyason I am seeing that XML parser appear to fix the lowercase issue, but not the issue with the XPath (/Image/Small works for me, but not //Image/Small).

If that is what you are seeing as well please open a new issue for visibility.

yanshiyason commented 4 years ago

I just tried it out and everything is working as expected so far!! Thank you @mischov!