the-real-blackh / hexpat

A general purpose Haskell XML library using Expat to do its parsing
BSD 3-Clause "New" or "Revised" License
2 stars 7 forks source link

Support for all characters #6

Open PierreVDL opened 7 years ago

PierreVDL commented 7 years ago

According to the xml standard e.g. http://www.w3.org/TR/xml/#sec-references one can refer to any charachter (including non-printables) using &#\d+; and &#\h+;. However, these characters seem to be ignored by hexpat.

For example, the HUnit test case

testSingleEscapedTextNode :: Test
testSingleEscapedTextNode = TestCase $ 
    let nodeName = "singleNode" in
        let nodeText = "a text with escaped characters & < >    is not correctly handled" in
        let (xml, mErr) = ( parse defaultParseOptions (pack $ map c2w ("<" ++ nodeName ++ ">" ++ nodeText ++ "</" ++ nodeName ++ ">") ) ) :: (UNode String, Maybe XMLParseError)  in do
            assertEqual "Single Node" xml (Element nodeName [] [Text nodeText])

yields as output

### Failure in: 0:Library Tests:1:Text.XML.Expat.Tree:3:Single Escaped Text Node
Single Node
expected: Element "singleNode" [] [Text "a text with escaped ",Text "characters ",Text "&",Text " ",Text "<",Text " ",Text ">",Text " "]
 but got: Element "singleNode" [] [Text "a text with escaped characters &amp; &lt; &gt; &#x12; &#x07; &#x1B; is not correctly handled"]

The list is wrong since it contains no elements after the first &#\h+; character.

Note: I know there is also an error in my test: it assumes only one Text element not a list of Text elements, but this is irrelevant for this problem!

PierreVDL commented 7 years ago

See also https://en.wikipedia.org/wiki/Valid_characters_in_XML. The characters are valid in XML 1.1

hartwork commented 3 years ago

Please note that the underlying parser is implementing XML 1.0 fourth edition, neither XML 1.1 nor XML 1.0 fifth edition. Expat ticket https://github.com/libexpat/libexpat/issues/171 may be of interest.