Incorrect handling of multi-byte UTF-8 sequences split across iks_parse() calls

GoogleCodeExporter commented 9 years ago

When multi-byte UTF-8 sequences in attribute names or values are passed to
iks_parse() across two or more invocations, the resulting DOM tree fails to
retain the complete UTF-8 sequences, rather it contains invalid UTF-8
sequences.

To observe the problem:
1. Declare an XML string constant that contains UTF-8 text in attribute
value, say:  <a b="田"/>
2. Pass the XML string to iks_parse() one byte at a time.
3. Examine the DOM tree.  The value of attribute "b" contains an invalid
UTF-8 sequence.

The expected value of attribute "b" should be "田".

This bug is present in trunk, and happens regardless of the OS.

The attached patch fixes the problem by detecting orphaned UTF-8 sequences
and pushing them onto the stack.

Original issue reported on code.google.com by timothy....@gmail.com on 31 May 2010 at 6:07

Attachments:

iksemel-parse-utf8-chunks.patch

GoogleCodeExporter commented 9 years ago

Replaced patch with a more generic version that handle chunks with different 
lengths.

Original comment by timothy....@gmail.com on 1 Jun 2010 at 1:43

Attachments:

iksemel-parse-utf8-chunks.patch

GoogleCodeExporter commented 9 years ago

It can not fix my problem.
I paste it on file sax.c, function sax_core (iksparser *prs, char *buf, int 
len).
Compiling in Windows Server 2008 by MinGW, and calling in vs2008 like:

int errNo = -1; 
char *buff = "<iq type='田'></iq>";// or <iq type=\"田\"></iq>
iks *tagParse = iks_tree(buff, strlen(buff), &errNo);
if(errNo == IKS_OK)
{
    char *strTest = iks_find_attrib(tagParse, "type");
}

Original comment by weijiayi...@gmail.com on 5 Nov 2011 at 7:30

plotters / iksemel

Incorrect handling of multi-byte UTF-8 sequences split across iks_parse() calls #24