Closed mischov closed 7 years ago
In regards to the questions posed above, I added Meeseeks.parse/2
and continued to parse strings as HTML in the selection functions.
I was running into this issue! I am parsing an xml file with PascalCase properties, but they were being returned lowercase, and certain tags like
<Image>
<Small>...</Small>
<Medium>...</Medium>
</Image>
were only selectable by doing
# works
Meeseeks.one(doc, "//small")
Meeseeks.one(doc, "//medium")
# does not work
Meeseeks.one(doc, "//image/small")
Meeseeks.one(doc, "//image/medium")
I will try again with the parse/2 function specifying xml.
@yanshiyason I am seeing that XML parser appear to fix the lowercase issue, but not the issue with the XPath (/Image/Small
works for me, but not //Image/Small
).
If that is what you are seeing as well please open a new issue for visibility.
I just tried it out and everything is working as expected so far!! Thank you @mischov!
There are people out there trying to use
Meeseeks
orFloki
with XML files, and that can lead to some confusing results (particularly in the case ofmeeseeks_html5ever
andhtml5ever_elixir
, which are HTML5 spec compliant).Rather than have people reach for the wrong tool because it's the one readily at hand, I'd prefer to provide an XML parser.
Note: This problem is unrelated to issue #11, which involves HTML documents, not XML documents.
Solution
The
html5ever
project also has a permissive XML parser,xml5ever
, and it should not be too complicated formeeseeks_html5ever
to expose functions for parsing XML.Meeseeks.parse
The least intrusive solution is to default
Meeseeks.parse/1
to parsing HTML and providing aMeeseeks.parse/2
that additionally takes a keyword specifying how to parseOtherwise I would need to make a breaking change and add
Meeseeks.parse_html/1
andMeeseeks.parse_xml/1
in the place ofMeeseeks.parse/1
.I would prefer not making a breaking change on such a core concern, but maybe the explicitness of
parse_html
andparse_xml
is worth it? Thoughts?Meeseeks.all
andMeeseeks.one
Currently
Meeseeks.all
andMeeseeks.one
accept a string as a source and then attempt to parse that string as HTML.My plan is to keep this behavior as is because it provides simple usage for the most common use case (selecting from HTML).
I can alternatively see arguments for disallowing string input and forcing the input to be parsed (because it doesn't hide parsing under the hood), or for defaulting to parsing as html but allowing some key in the context indicate document type, but again I would prefer to not make breaking changes. Anybody have a strong feeling on the subject?