Parsing xml fragment should be supported

GoogleCodeExporter commented 9 years ago

What feature do you require in pugixml?

I require pugixml to read an xml fragment. For example I want to be able to 
parse the following xml:

"I'm text from a bigger xml<source>MyBigFile.xml</source>. I am well formed xml 
<reallyIam /> but not well formed document."

In C# I would accomplish this using XmlReader with 
System.Xml.XmlReaderSettings.ConformanceLevel == Fragment

What do you need this feature for?
I am parsing xml fragments from an already indexed xml document and want to 
parse only a small part of it.

Original issue reported on code.google.com by philly.d...@gmail.com on 8 Feb 2013 at 8:45

GoogleCodeExporter commented 9 years ago

Can you confirm that replacing the condition in this block (pugixml.cpp, line 
2612 in trunk):

    if (cursor->parent)
    {
        PUGI__PUSHNODE(node_pcdata); // Append a new node on the tree.
        cursor->value = s; // Save the offset.

        s = strconv_pcdata(s);

        PUGI__POPNODE(); // Pop since this is a standalone.

        if (!*s) break;
    }

With "if (true)" gives you the desired behavior? I.e. that the only thing 
required for fragment parsing that pugixml does not do right now is preserving 
PCDATA nodes that don't have a parent node.

Original comment by arseny.k...@gmail.com on 10 Feb 2013 at 7:54

Changed state: Accepted

GoogleCodeExporter commented 9 years ago

Yep, looks like that's it.

Original comment by philly.d...@gmail.com on 11 Feb 2013 at 9:50

GoogleCodeExporter commented 9 years ago

One last thing. Since this is likely to be called frequently and on small 
fragments, it would be better for performance if parsing fragment had its own 
method that would bypass encoding detection and conversion (native pugixml 
encoding assumed) but would still skip BOM.

Original comment by philly.d...@gmail.com on 14 Feb 2013 at 1:59

GoogleCodeExporter commented 9 years ago

I looked into making this change in pugixml with a special parsing option flag, 
and unfortunately there are some issues.

pugixml has a major assumption that the last character of the XML buffer is 
never required by the user. This is needed to allow in-place parsing of a non 
null-terminated buffer (see load_buffer) but to return data using 
null-terminated strings. Obviously, if you have a 4-byte "text" buffer, it's 
impossible to create a document that will return null-terminated "text" string 
without extra string copies.

This assumption holds for all well-formed XML documents, but does not hold for 
document fragments.

The only way to implement this, then, is to essentially make sure that if 
parsing works in fragment mode, there buffer being parsed is zero-terminated. 
This means that all APIs will have to do a string copy for fragment parsing 
(currently load_buffer_inplace does not do that if possible), making fragment 
parsing more expensive than full document parsing.

Original comment by arseny.k...@gmail.com on 21 Aug 2013 at 6:38

GoogleCodeExporter commented 9 years ago

In the scenario I had in mind, the input is a null terminated string. Basically 
what I need this for is to parse a pre-indexed fragment of a very large xml 
document. The fragment is guaranteed not to end inside a tag. Ex input: "My 
sentence.<bold> My bold sentence.</bold> My other sentence."

Original comment by philly.d...@gmail.com on 22 Aug 2013 at 1:26

GoogleCodeExporter commented 9 years ago

This is now fixed as of r980 (phew). parse_fragment flag enables fragment mode 
parsing.

The only caveat is that if you use load_buffer_inplace or 
load_buffer_inplace_own, you have to provide a null-terminated buffer (i.e. for 
"test" invoke it on a buffer that has a null terminator and pass the size 5) 
for all contents to be preserved.

If you don't use load_buffer_inplace_* then pugixml figures everything out 
internally.

Original comment by arseny.k...@gmail.com on 11 Feb 2014 at 6:51

Changed state: Fixed

letanphuc / pugixml

Parsing xml fragment should be supported #195