Open unhammer opened 6 years ago
Neil Mitchell said that he just memory maps the file. Perhaps you could try that and report back?
https://hackage.haskell.org/package/mmap-0.5.9/docs/System-IO-MMap.html#v:mmapFileByteString
I'd be interested to know the speed of it.
By the way, the sax package is a streaming parser implemented using Wrong, see below.xeno
. So that might be an option if you want to stream.
Neil Mitchell said that he just memory maps the file. Perhaps you could try that and report back?
https://hackage.haskell.org/package/mmap-0.5.9/docs/System-IO-MMap.html#v:mmapFileByteString
Nice, I didn't know about that.
In our case, the huge xml files are zipped, so we can't mmap without first wasting some disk space, but I suppose that's a possibility. We've been using https://hackage.haskell.org/package/zip-1.0.0/docs/Codec-Archive-Zip.html#v:sourceEntry with xml-conduit so far, but it's slow.
A quick-and-dirty speed comparison shows process print (\_ _ -> return ()) c c c c where c = (\_ -> return ())
taking 12s on a 1G file, where mawk '/</{print}'
takes 1.5s and a stripped down version of our xml-conduit parser (to only prints elements) takes about 4 minutes. So this looks promising :-)
SAX seems to only stream on the output side – but maybe it would help with avoiding space leaks? I notice the "count elements" examples from the README using fold
at least gobbled memory.
A quick-and-dirty speed comparison shows process print (_ -> return ()) c c c c where c = (\ -> return ()) taking 12s on a 1G file,
Interesting! I expect most of that speed hit comes from print
(has to convert the ByteString to String and then use the slow putStrLn which prints each character at a time). Probably S8.putStrLn
from Data.ByteString.Char8
would be faster. I'd be curious to see how much time is taken with no-ops for everything.
a stripped down version of our xml-conduit parser (to only prints elements) takes about 4 minutes.
Eesh! Question: how much long does it take to simply unzip the file with sourceEntry
and e.g. sinkNull
? I'm curious about how much the unzip takes versus xml-conduit. If we were to implement a streaming solution to xeno, it's possible that the unzipping overhead would be so much that a fast XML parser doesn't matter.
SAX seems to only stream on the output side
That's a good point, it does indeed only stream on the output side.
Interesting! I expect most of that speed hit comes from
S8.putStrLn
fromData.ByteString.Char8
would be faster. I'd be curious to see how much time is taken with no-ops for everything.
So S8.putStrLn
takes it down to 6s, and (\_ -> return ())
down to 4s. I didn't even try noop at first, since I'm used to ghc completely optimising such things away, giving useless benchmarks =P but 4s seems like it's doing what we mean here.
a stripped down version of our xml-conduit parser (to only prints elements) takes about 4 minutes.
Eesh! Question: how much long does it take to simply unzip the file with
sourceEntry
and e.g.sinkNull
? I'm curious about how much the unzip takes versus xml-conduit. If we were to implement a streaming solution to xeno, it's possible that the unzipping speed would be so much that a fast XML parser doesn't matter.
Actually, it's faster than zcat, under 2s! Even if I insert CL.mapM (liftIO . S8.putStr) .|
in there. I'm going to start using that instead of zcat on the cli for these single-file zip files …
This might be related: https://github.com/snoyberg/conduit/issues/370
Hm, I haven't noticed high memory usage with xml-conduit here, just very high cpu usage.
(EDIT: If I upgrade conduit to 1.3.1, I do see https://github.com/snoyberg/conduit/issues/370 – but we've been on 1.2, where there is reasonable memory usage.)
Hm, I haven't noticed high memory usage with xml-conduit here, just very high cpu usage.
One thing to check would be the GC productivity and allocations info. The CPU usage might as well be high because the program spends too much time cleaning up after itself.
@chrisdone should we add memory-mapped file access?
@ocramz I think that's already available from https://hackage.haskell.org/package/mmap-0.5.9/docs/System-IO-MMap.html#v:mmapFileByteString
Even with mmap-ing, I can't get this to have constant memory usage.
-rw-rw-r-- 1 unhammer unhammer 15M okt. 18 11:01 big.xml
-rw-rw-r-- 1 unhammer unhammer 337 okt. 18 11:03 small.xml
If I do
[path] <- getArgs
bs <- MM.mmapFileByteString path Nothing
print $ Data.ByteString.length bs
I get constant memory usage, /usr/bin/time
gives ~51980maxresidentk) for both big.xml and small.xml.
If I do
noop :: ByteString -> Either XenoException Integer
noop = Xeno.fold const (\m _ _ -> m) const const const const 0
main = do
[path] <- getArgs
bs <- MM.mmapFileByteString path Nothing
print $ noop bs
then small.xml gives 52080maxresidentk while big.xml gives 4286264maxresidentk.
I've tried building xeno both with and without -f-no-full-laziness
.
Tested on ghc 8.6.5, xeno 82be21b3273f09e5e31e19418eb15adf3f6d3c09
@unhammer How big are your files that do not fit in memory? Can we have a look at them? We parsed multigigabyte files fine, but maybe that is not what you need.
see gmail @mgajda
For the time being I recommend using bytestring-mmap
to map entire ByteString
, since the memory is foreign, and can be swapped out without compromising garbage collection time.
If you have a compressed file, I would recommend trying to uncompress it on-disk (in /tmp
) and consume it from there with mmap
.
If there are any performance issues, that would be due to garbage collection time (GC time) - you may decrease it by processing the nodes inside XML in a streaming fashion.
From what I can tell from reading the (short!) source, there doesn't seem to be a way to use this on files too large to fit in memory. In particular, the ByteString's are strict, and if one simply tries splitting and sending in the parts, there's a possibility of getting errors like
Left (XenoParseError "Couldn't find the matching quote character.")
I'd rather not have Lazy ByteStrings because reasons, though that would probably be the simplest solution.
An alternative would be to add an
eofF :: (Int -> m ())
toprocess
, and pass it the index of the last fully processed element, and then call that wherever we now callthrow
, which would allow simply sending in arbitrarily sized ByteString's and stitch them using the eoF function. But most of the throw's happen ins_index
– making that return a Maybe seems like it would hurt performance.Or is it possible to make
indexEndOfLastMatch
an argument ofXenoException
's, andcatch
the throw's without losing the monad return value?