leethomason / tinyxml2

TinyXML2 is a simple, small, efficient, C++ XML parser that can be easily integrated into other programs.
zlib License
5.12k stars 1.84k forks source link

TinyXML2 uses seek() - which causes read source restrictions #514

Open antofthy opened 7 years ago

antofthy commented 7 years ago

Tinyxml uses of seek() when reading the XML code. This places restrictions on the source of the XML data.

You can not used named pipes, or network streams, as a source for the XML data.

I use application (for more than 7 years now) that recently switched to using "tinyxml2" (encfs). I use a named pipe to feed the XML data from a database into that application, and that has worked great. Now it no longer working. (https://github.com/vgough/encfs/issues/253)

The reasons I used a named pipe was many fold... It is more secure (data is only available once, and NOT written to a disk in the clear), faster (data is in a memory buffer not disk), and more parallel (data can be processsed and written to the named pipe buffer while application is initialling).

However "tinyxml2" uses seeks, and relies on the XML being saved on disk. From the code https://github.com/leethomason/tinyxml2/blob/3.0.0/tinyxml2.cpp#L1832 it just reads the XML into its own memory, in a single pass! It is only using seek() to pre-allocate the memory buffer for the XML data, and for additional security checks. Both of these can be done without needing the use of seek().

The only valid reason I can see for the use of seek() is to either: allow multi-pass reading of a file; or to update the file (rewind or make a specific change). Neither is required by "tinyxml2".

I strongly recommend you re-think your use of seek(), and its 'disk based files' restriction, in your library.

Dmitry-Me commented 7 years ago

What would be a better option then?

antofthy commented 7 years ago

The normal practice is to read the file with a read limit based on the initial buffer size (commonly 4kb, something like or getpagesize(), to start with). If EOF is not reached, expand that buffer size, and continue reading to the new limit. I am sure there are thousands of examples you can follow.

I found quite a few just with a goggle "read unknown size", and gives a lot of pointers about it. The first StackOverflow, Reading text file of unknown size is a great start. It even mentions the 'fseek()' problem in the example present that used the same method as you. But it is worth looking at as many examples and discussions to get the feel for the pitfalls.

antofthy commented 7 years ago

This code (after adding commented changes) seems to do what you want, and does it in large blocks. It uses a structure of a single buffer and file length, but you can change that easilly.

Stack Exchange Code Review, Reading a file into memory

Also watch the problem of input data being binary (accidentally or maliciously on purpose) and thus may have null characters, or no newline/return characters in the input at all!

antofthy commented 7 years ago

Alternative -- auto-allocate and read the while file stream, using getdelim() with a null char as terminator. It is basically the same as getline() but can be used to slurp whole text file streams (not binary). From... StackOverflow, Easiest way to get file's contents in C

   FILE   *file = stdin;
   char   *data = NULL;
   size_t  size = 0;
   int     nchars;

   nchars = getdelim(&data, &size, 0, file);
   assert( nchars >= 0 );    /* read error */
leethomason commented 7 years ago

To have a general streaming system, TinyXML-2 would only want to read the XML file. That way it could be part of a data stream where non-XML data proceeded and followed the XML. To do that single pass in a DOM parser that holds onto pointers in memory (all of which TinyXML-2 is/does) is pretty tricky.

TinyXML-1 could do it because it copied everything and just took a performance hit.

In your particular case, you should probably just copy the stream to a memory buffer and call TinyXML-2 with that. A general solution is probably beyond what TinyXML-2 will do. An intermediate solution (look for a deliminator and copy that to a memory buffer) seems like logic that should be in the host app, not TinyXML-2, since it will work for some applications and not others.

antofthy commented 7 years ago

You miss understand. I don't use TinyXML2 directly. I use an application that switched to using it.

For years I passed a named pipe to the application (but it could have been a '/dev/fd pipeline too), for it to read the XML data from. The application read data from a given filename (which is normally a file, but didn't have to be a actual file). The Application switched to using TinyXML2 and thanks to its use of fseek() the application can no longer read from named pipes (or fd device pipes). Now it needs a actual file stored on a actual disk, which is not where the XML is stored, and thus requiting saving the data to a tempory file (and less secure) file for TinyXML2 to read from.

I am also talking with the application developer, who has referred me to you.

Yes it would be nice for it to be completely 'stream' compatible, in that it stops reading when the End-of-Data is reached in the XML stream. However, that requires a rewrite of the parser to parse and understand the data as it is read in, so it can identify EOD as apposed to EOF. Just getting rid of the use of 'fseek()' when reading from a file (plain, device, or pipeline) would open up more possibilities.

leethomason commented 7 years ago

The temporary file solution seems like a good one, although I don't understand your security landscape.

There really isn't a good workaround in TinyXML-2. It's chicken and egg: TinyXML-2 can't allocate enough buffer without parsing the XML which it can't do without a buffer for the XML. And, as noted, the stream solution (while good for your use case) comes with a lot of performance and memory compromises in a DOM parser.

One can imagine some fancy chunk memory reading to get around that, but...tricky. A simplified parser just for the purpose of finding the end of the XML is an option, but that's a lot of duplicated code. Or a special mode that doesn't store any information about the XMLNodes (so has no pointer into the memory buffer) is also an option. But none of those strike me as a straightforward tweak to the current architecture.

antofthy commented 7 years ago

Look man page for getdelim() with a NULL delimiter (see above) it will do all the allocation work for you

leethomason commented 7 years ago

The allocation isn't the issue, or the delimiter. It's knowing how much to read. How long is the XML? There's no way to know that without doing some kind of parsing of the XML. I guess in your particular case it addresses it by reading to the end without seek(), but that's not really the desired general behavior.

ajtruckle commented 7 years ago

I don't fully understand all this discussion but tiny xml 2 has two methods for opening a file. The first takes an actual file. The second takes a stream. Can't that be used?

Also, if you understand what is required can't you consider making the changes and proposing them to the author? A lot of work has freely been put in to this voluntarily for us and not many of us say thank you. If we had to ask for our requests to be programmed it would not come cheap.

I am grateful for what has been provided but this is as open source project. So if you have a good knowledge to rectify an issue please consider an implementation and put it forward.

Sometimes it just a matter of time and life gets in the way. Family, health, work. And to do big changes just might not be the option for some at the moment.

I don't mind if this comment is removed. It is not meant as negative. But I felt opportunity to express thanks as a user and remind all of etiquette in our approach about issues for what has been made freely available. ?

Have a great day. I hope your issue gets resolved amicably.

Sent from my iPhone

On 6 Jan 2017, at 01:01, Anthony Thyssen notifications@github.com<mailto:notifications@github.com> wrote:

Look man page for getdelim() with a NULL delimiter (see above) it will do all the allocation work for you

- You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/leethomason/tinyxml2/issues/514#issuecomment-270803281, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AHnYsyiDhOljHZUrV1WkWu_up-Cm7a6kks5rPZJzgaJpZM4LaL4m.

leethomason commented 7 years ago

@ajtruckle It's not the API signature that's the trouble, it's the way the stream (FILE) in interpreted. The general interpretation of the feature request from @antofthy makes a lot of sense: the FILE version should treat it as a stream, not specifically a FILE*. That's just hard to implement practically. @antofthy proposed a narrower solution which doesn't fit well. (Although a possibility would be to pass in a file length - then the calling program could specify how much stream should be consumed. I'd have to think about that, but @antofthy would it fix the issue?)

But the 2 main points: 1) I don't think anyone is upset (or certainly hope not), and 2) I am the main developer / author.

antofthy commented 7 years ago

@leethomason wrote:The allocation isn't the issue, or the delimiter. It's knowing how much to read. @ajtruckle wrote: pass in a file length

You miss understand... You don't need to know how big the file is before hand. The getdelim() will allocate enough memory to hold the file up to the delimiter (NULL) or EOF for you. No pre-allocation needed!

RTFM...

getdelim() works like getline(), except that a line delimiter other than newline can be specified.

AND BEFORE THAT...

if lineptr is set to NULL and n is set 0 before the call, then getline() will allocate a buffer for storing the line. If the buffer is not large enough to hold the line, getline() resizes it with realloc(3), updating lineptr and n as necessary.

In otherwords getdelim() works like the other methods I pointed out at the beginning of this thread, re-allocating more memory as needed. The only difference is it not only stops on EOF but also has a 'delimiter'. For text files (like XML) using a delimited of NULL has no bearing.

No one is upset. Just trying to improve things, by removing a limitation that I came across.

leethomason commented 7 years ago

Let me see if I can summarize:

Assuming that's correct, then getdelim() is theoretically a replacement, with a couple of practical issues:

antofthy commented 7 years ago

I took a scan of the machines I have access to...

It is on fedora, and Redhat 5 to 7 machines. As pretty well all linux based machines use GNU they should all have access to it. Actually it is part of glibc so all GCC compiled programs should have access to it.

Oracle website lists the function for Solaris 11 machines (confirmed on a actual box) https://docs.oracle.com/cd/E53394_01/html/E54766/getdelim-3c.html

The man page for it is present on MacOSX, though it is written slightly differently (describing getdelim() before getline()), it describes the exact same functionality.

Found in NetBSD 6 and 7 libc manpages

IBM AIX 6.1 has it too. http://www.ibm.com/support/knowledgecenter/ssw_aix_61/com.ibm.aix.basetrf1/getline.htm

Example implementations of the function is available on https://gist.github.com/ingramj/1105106 https://github.com/bwhite/hadoopy/blob/master/hadoopy/getdelim.c

It looks like part of the POSIX 2008 standard, which is a good indication that it is widely available.