ocramz / xeno

Fast Haskell XML parser
Other
120 stars 33 forks source link

Use on files too large to fit in memory #18

Open unhammer opened 6 years ago

unhammer commented 6 years ago

From what I can tell from reading the (short!) source, there doesn't seem to be a way to use this on files too large to fit in memory. In particular, the ByteString's are strict, and if one simply tries splitting and sending in the parts, there's a possibility of getting errors like

Left (XenoParseError "Couldn't find the matching quote character.")

I'd rather not have Lazy ByteStrings because reasons, though that would probably be the simplest solution.

An alternative would be to add an eofF :: (Int -> m ()) to process, and pass it the index of the last fully processed element, and then call that wherever we now call throw, which would allow simply sending in arbitrarily sized ByteString's and stitch them using the eoF function. But most of the throw's happen in s_index – making that return a Maybe seems like it would hurt performance.

Or is it possible to make indexEndOfLastMatch an argument of XenoException's, and catch the throw's without losing the monad return value?

chrisdone commented 6 years ago

Neil Mitchell said that he just memory maps the file. Perhaps you could try that and report back?

https://hackage.haskell.org/package/mmap-0.5.9/docs/System-IO-MMap.html#v:mmapFileByteString

chrisdone commented 6 years ago

I'd be interested to know the speed of it.

chrisdone commented 6 years ago

By the way, the sax package is a streaming parser implemented using xeno. So that might be an option if you want to stream. Wrong, see below.

unhammer commented 6 years ago

Neil Mitchell said that he just memory maps the file. Perhaps you could try that and report back?

https://hackage.haskell.org/package/mmap-0.5.9/docs/System-IO-MMap.html#v:mmapFileByteString

Nice, I didn't know about that.

In our case, the huge xml files are zipped, so we can't mmap without first wasting some disk space, but I suppose that's a possibility. We've been using https://hackage.haskell.org/package/zip-1.0.0/docs/Codec-Archive-Zip.html#v:sourceEntry with xml-conduit so far, but it's slow.

unhammer commented 6 years ago

A quick-and-dirty speed comparison shows process print (\_ _ -> return ()) c c c c where c = (\_ -> return ()) taking 12s on a 1G file, where mawk '/</{print}' takes 1.5s and a stripped down version of our xml-conduit parser (to only prints elements) takes about 4 minutes. So this looks promising :-)

unhammer commented 6 years ago

SAX seems to only stream on the output side – but maybe it would help with avoiding space leaks? I notice the "count elements" examples from the README using fold at least gobbled memory.

chrisdone commented 6 years ago

A quick-and-dirty speed comparison shows process print (_ -> return ()) c c c c where c = (\ -> return ()) taking 12s on a 1G file,

Interesting! I expect most of that speed hit comes from print (has to convert the ByteString to String and then use the slow putStrLn which prints each character at a time). Probably S8.putStrLn from Data.ByteString.Char8 would be faster. I'd be curious to see how much time is taken with no-ops for everything.

a stripped down version of our xml-conduit parser (to only prints elements) takes about 4 minutes.

Eesh! Question: how much long does it take to simply unzip the file with sourceEntry and e.g. sinkNull? I'm curious about how much the unzip takes versus xml-conduit. If we were to implement a streaming solution to xeno, it's possible that the unzipping overhead would be so much that a fast XML parser doesn't matter.

SAX seems to only stream on the output side

That's a good point, it does indeed only stream on the output side.

unhammer commented 6 years ago

Interesting! I expect most of that speed hit comes from print (has to convert the ByteString to String and then use the slow putStrLn which prints each character at a time). Probably S8.putStrLn from Data.ByteString.Char8 would be faster. I'd be curious to see how much time is taken with no-ops for everything.

So S8.putStrLn takes it down to 6s, and (\_ -> return ()) down to 4s. I didn't even try noop at first, since I'm used to ghc completely optimising such things away, giving useless benchmarks =P but 4s seems like it's doing what we mean here.

a stripped down version of our xml-conduit parser (to only prints elements) takes about 4 minutes.

Eesh! Question: how much long does it take to simply unzip the file with sourceEntry and e.g. sinkNull? I'm curious about how much the unzip takes versus xml-conduit. If we were to implement a streaming solution to xeno, it's possible that the unzipping speed would be so much that a fast XML parser doesn't matter.

Actually, it's faster than zcat, under 2s! Even if I insert CL.mapM (liftIO . S8.putStr) .| in there. I'm going to start using that instead of zcat on the cli for these single-file zip files …

chrisdone commented 6 years ago

This might be related: https://github.com/snoyberg/conduit/issues/370

unhammer commented 6 years ago

Hm, I haven't noticed high memory usage with xml-conduit here, just very high cpu usage.

(EDIT: If I upgrade conduit to 1.3.1, I do see https://github.com/snoyberg/conduit/issues/370 – but we've been on 1.2, where there is reasonable memory usage.)

zudov commented 6 years ago

Hm, I haven't noticed high memory usage with xml-conduit here, just very high cpu usage.

One thing to check would be the GC productivity and allocations info. The CPU usage might as well be high because the program spends too much time cleaning up after itself.

ocramz commented 6 years ago

@chrisdone should we add memory-mapped file access?

chrisdone commented 6 years ago

@ocramz I think that's already available from https://hackage.haskell.org/package/mmap-0.5.9/docs/System-IO-MMap.html#v:mmapFileByteString

unhammer commented 5 years ago

Even with mmap-ing, I can't get this to have constant memory usage.

-rw-rw-r-- 1 unhammer unhammer 15M okt.  18 11:01 big.xml
-rw-rw-r-- 1 unhammer unhammer 337 okt.  18 11:03 small.xml

If I do

  [path] <- getArgs
  bs <- MM.mmapFileByteString path Nothing
  print $ Data.ByteString.length bs

I get constant memory usage, /usr/bin/time gives ~51980maxresidentk) for both big.xml and small.xml.

If I do

noop :: ByteString -> Either XenoException Integer
noop = Xeno.fold const (\m _ _ -> m) const const const const 0

main = do
  [path] <- getArgs
  bs <- MM.mmapFileByteString path Nothing
  print $ noop bs

then small.xml gives 52080maxresidentk while big.xml gives 4286264maxresidentk.

bilde

bilde

bilde

I've tried building xeno both with and without -f-no-full-laziness. Tested on ghc 8.6.5, xeno 82be21b3273f09e5e31e19418eb15adf3f6d3c09

mgajda commented 3 years ago

@unhammer How big are your files that do not fit in memory? Can we have a look at them? We parsed multigigabyte files fine, but maybe that is not what you need.

unhammer commented 3 years ago

see gmail @mgajda

mgajda commented 3 years ago

For the time being I recommend using bytestring-mmap to map entire ByteString, since the memory is foreign, and can be swapped out without compromising garbage collection time. If you have a compressed file, I would recommend trying to uncompress it on-disk (in /tmp) and consume it from there with mmap.

If there are any performance issues, that would be due to garbage collection time (GC time) - you may decrease it by processing the nodes inside XML in a streaming fashion.