jmdavis / dxml

An XML parsing library written in D.
Boost Software License 1.0
32 stars 10 forks source link

Parsing xml without loading the whole file into RAM #15

Closed FreeSlave closed 1 year ago

FreeSlave commented 6 years ago

Is it possible to do with this library? Maybe you can provide examples that show memory-efficient usage of DOM and StAX parsers.

jmdavis commented 6 years ago

Memory management of what's being parsed is left up to the range being parsed and is not the concern of dxml. The parser will operate on any forward range of char, wchar, or dchar. I thought that the documentation was clear about that. As such, if you have a forward range of characters over a file which does not read in the entire file at once, you can parse the file without reading into all into memory (though obviously, any parts you keep around will then stay in memory).

However, if you're going to use parseDOM, then any portion of the document that it parses is going to result in memory allocations in order to build the DOM regardless of the underlying range being parsed. That's going to be true of any DOM parser since the whole point of a DOM parser is to build the document tree in memory.

FreeSlave commented 6 years ago

I see, but do you have any preferred solution? Phobos does not seem to provide any means to represent file as a forward range without loading the whole file. DOM of course needs to allocate some structs that represent a tree, but still uses slices of original range to hold the stored data. The point is to minimize allocations, not nullify them.

Upd: I've found MmFile can remap portions of file on demand when window argument is given. Still needs a little trickery to make it a forward range, but it might work.

jmdavis commented 6 years ago

As dxml uses the standard range mechanism, it bypasses the issue in the sense that it doesn't provide the range that reads in the file efficiently. It assumes that it already exists. Unfortunately, reading from a file efficiently is kind of the achilles heel of ranges in that every time you call save you then need that range to always be valid for as long as it exists, so in order to buffer file access, you could need to have an arbitrarily large number of buffers, and it gets complicated. It's something that needs to be solved, but Phobos has largely ignored the issue (probably because it's complicated and no one really wants to take the time to write it). It does have stuff like std.stdio's byLine or byChunk, to read lines or chunks efficiently, but that translates to a range of bytes or characters only awkwardly, because what you're really getting then is ranges of lines or chunks (and since those algorithms reuse their buffers, it gets even more complicated). Actually, properly, buffering chunks of a file and referencing-counting that with the range API to properly support save gets complicated fast, and in a lot of cases, simply reading the file in in pieces rather than as a forward range or reading it in all at once avoids the whole issue (though that obviously isn't always an option).

Personally, if I had to read a file in as a forward range and couldn't read it all in at once, I'd probably just use std.mmfile rather than trying to deal with buffering everything, since that gets really complicated. There's always Steven's https://github.com/schveiguy/iopipe, but it's still a work in progress, and as I understand it, he's had to work around the range API on some level precisely because it's so poorly suited to reading in a file efficiently, so I don't know exactly how that's going to work with the range API. I'm aware of iopipe but have to spend time studying it.

I wrote dxml the way I did so that it could work with a range that read over a file without reading it all in but without trying to actually solve that problem. By just operating on ranges, it pushes that entire problem off to ranges, which doesn't entirely solve the problem, but it does mean that as long as the problem is solved with ranges in general, it's solved for dxml.

JesseKPhillips commented 5 years ago

Yeah, and byline/splitlines aren't good for parsers because they loose vital line ending information. Xml cdata section are most likely the problem for Xml.

I did a range over mmap files awhile back. https://github.com/JesseKPhillips/libosm/blob/master/source/util/filerange.d

ghost91- commented 1 year ago

I've made good experiences with using std.mmfile together with dxml. I’ve written a tool that processes a full Wikipedia export (70 GB) this way, and it worked quite well.

https://github.com/ghost91-/wikipedia-indexer/tree/master