monitorjbl / excel-streaming-reader

An easy-to-use implementation of a streaming Excel reader using Apache POI
Apache License 2.0
947 stars 343 forks source link

really slow when reading large xlsx file(50000 row) #104

Open janlely opened 7 years ago

janlely commented 7 years ago

it cost several minute to read a > 50000 row xlsx file using code: workbook = StreamingReader.builder() .rowCacheSize(100) .bufferSize(4096) .open(is);

even slower then workbook = WorkbookFactroy.create(is)

monitorjbl commented 7 years ago

So, this library isn't guaranteed to be faster in all cases. If your server can handle the memory load of using vanilla POI, it might make sense to use it. What this library is meant to do is reduce memory usage, which can also increase speed depending on your use case. If your heap is on the smaller side or if you have a lot of clients, then your speed will increase due to lack of GCs slowing you down.

That being said, this library is usually about as fast as POI. If you're seeing extreme slowdowns, I'd check out your disk speeds to see if you've got a bottleneck there. This library substitutes disk for memory, so If your disk is slow you're going to have a problem reading.

edgarvonk commented 6 years ago

Hi,

We also struggle with reading in very large files. In our case they can be up to 1GB and can contain up to 1 million rows in the sheet we want to stream. By far the most time is spent in the .open(File file) method as noted above.

When I look at what is happening in a profiler if I am not mistaken it seems that most of the time is spent in FileBackedList#writeToFile. There is also a huge amount of garbage collector activity going on.

We currently use the following settings:

StreamingReader.builder()
            .rowCacheSize(1000)
            .bufferSize(102400)
            .sstCacheSize(1485760)
            .open(file);

Any thoughts on how we could improve the performance of opening such huge files? We have tried various other settings too including much smaller buffer and cache sizes but have not major differences so far.

ps: we have to set the SST cache size because otherwise we run out of heap for these files (with -Xmx1G).

Example of a thread dump while opening such a file:


"SimpleAsyncTaskExecutor-1" prio=5 tid=161 RUNNABLE
    at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
    at sun.nio.ch.FileDispatcherImpl.write(FileDispatcherImpl.java:60)
    at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
    at sun.nio.ch.IOUtil.write(IOUtil.java:65)
       Local Variable: java.nio.DirectByteBuffer#10
       Local Variable: java.io.FileDescriptor#25
       Local Variable: sun.nio.ch.FileDispatcherImpl#1
    at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:211)
       Local Variable: java.lang.Object#28289
    at com.monitorjbl.xlsx.sst.FileBackedList.writeToFile(FileBackedList.java:89)
       Local Variable: java.nio.HeapByteBuffer#25
       Local Variable: java.nio.HeapByteBuffer#26
       Local Variable: sun.nio.ch.FileChannelImpl#1
    at com.monitorjbl.xlsx.sst.FileBackedList.add(FileBackedList.java:61)
       Local Variable: com.monitorjbl.xlsx.sst.CTRstImpl#2
       Local Variable: com.monitorjbl.xlsx.sst.FileBackedList#1
    at com.monitorjbl.xlsx.sst.BufferedStringsTable.readFrom(BufferedStringsTable.java:42)
       Local Variable: com.sun.xml.internal.stream.events.StartElementEvent#1
       Local Variable: com.sun.xml.internal.stream.XMLEventReaderImpl#1
       Local Variable: org.apache.poi.openxml4j.util.ZipSecureFile$ThresholdInputStream#2
    at com.monitorjbl.xlsx.sst.BufferedStringsTable.<init>(BufferedStringsTable.java:30)
       Local Variable: com.monitorjbl.xlsx.sst.BufferedStringsTable#1
       Local Variable: org.apache.poi.openxml4j.opc.ZipPackagePart#1
    at com.monitorjbl.xlsx.sst.BufferedStringsTable.getSharedStringsTable(BufferedStringsTable.java:25)
       Local Variable: java.util.ArrayList#12024
       Local Variable: java.io.File#83
       Local Variable: org.apache.poi.openxml4j.opc.ZipPackage#1
    at com.monitorjbl.xlsx.impl.StreamingWorkbookReader.init(StreamingWorkbookReader.java:118)
       Local Variable: org.apache.poi.xssf.eventusermodel.XSSFReader#1
    at com.monitorjbl.xlsx.StreamingReader$Builder.open(StreamingReader.java:278)
       Local Variable: com.monitorjbl.xlsx.impl.StreamingWorkbookReader#1
       Local Variable: com.monitorjbl.xlsx.StreamingReader$Builder#1
monitorjbl commented 6 years ago

Vanilla POI is going to be faster because it buffers almost everything into memory, so if performance is the main goal you should not be using this library. The intent of this library is to reduce the memory footprint of reading spreadsheets, and it does so by sacrificing memory-backed storage with file-backed storage.

The SST cache option avoids loading the Shared Strings Table entirely into memory (the default behavior) and instead buffers it to a file with a variable in-memory cache size to keep frequently-read data "hot" in memory. With a cache size that big, you're just putting the entire thing in memory.

edgarvonk commented 6 years ago

Hi @monitorjbl, thanks for the quick reply!

We started out with vanilla POI but the files that we need to process are just way too big to fit into our memory so that's why we switched to your library. Yes, indeed, we set the SST cache size very large at the moment but it's definitely not the whole thing yet.. But I have to admit that we mostly are testing now with generated test files with lots of random data and hence an extremely high cardinality which will probably never happen in real life. I will switch to some more realistic test files soon.

The good part is that with the streaming excel reader we no longer run out of heap space, but yes, the performance is not optimal. Maybe we should think about allocating more memory to our proces to increase speed. It's a trade off in the end.

edgarvonk commented 6 years ago

Forgot to mention: our use case is a customer facing (Spring Boot) application where users can upload (potentially very large) Excel files which we have to store, read and parse real-time and store certain cell values (but not all by a long shot) that we find in the files. Because we have to be able to deal with concurrent users we can not keep the Excel files in memory even if they are not huge and therefore vanilla Apache POI is not an option.

As stated for large files with lots of string fields with high cardinality creating the (file-system) SST seems by far our performance bottleneck at the moment. In our use case I think the SST is not going to help us much if at all in any case (since we only need to parse the document once, row by row, then we throw it away) I was wondering if there is any way to disable the SST altogether, perhaps by extending some classes of the streaming Excel reader lib @monitorjbl ?

monitorjbl commented 6 years ago

Unfortunately, no. The SST is where non-inline strings actually reside, so they have to be loaded in some form or another. High-cardinality sheets are going to be problematic there because the SST only stores each unique string once. You could simply not read from the SST, but then you'd have no way of knowing what data you were missing.

The SST implementation is basically an array and the XML of your sheet specifies indexes to that array. The SST cache implementation in this library is a file-backed list that I wrote that allows the library to avoid buffering the entire SST into memory. It's definitely not perfect, so if you can pinpoint some specific problems with it perhaps they can be addressed.

boboetuk commented 6 years ago

Hi, @edgarvonk Have you tried using an in memory drive to store the temporary files or SSD drives?

monitorjbl commented 6 years ago

Couple of potential areas that might be causing issues:

edgarvonk commented 6 years ago

Thanks so much guys! When we can find the time we will look into possible improvements here. From what I have seen so far in our tests the bottleneck lies mostly in writing the SST to the file system in the first place, not reading from it. But that is probably because so far we are not using string cells yet in our use case. This will change soon however...

edgarvonk commented 6 years ago

Another area where we could improve performance is the fact that we need to parse each (potentially very large) Excel file three times. This is because of the required flow in our application where we cannot extract the data that we need from the Excel file until we done some other things with the file.

Every time we (re)open the file of course the SST is created again. So if we could find a way to cache the SST for a certain amount of time that would improve the performance of our flow quite a bit too.

monitorjbl commented 6 years ago

The SST is tied to a specific workbook and externalizing the cache would make it possible to leak memory and files. The feature you're asking for would be need to be an API change to allow sheet reading to be reset, much like InputStream does. Then you can simply keep the Workbook object around your application in some way between reads.

I think this could probably be done, but it would need to be a separate feature request and would likely take some time to verify that everything works properly.

GregTVan commented 6 years ago

I am using your utility with great success, thank you. A general question about vanilla POI (without using your utility). Does it make sense to you that when executing 'new XSSFWorkbook(x)', where x is about 1mb of data, Java is returning 'Native memory allocation (malloc) failed to allocate 136863744 bytes.' I don't understand why such a small file is causing problems, and even more than that, why POI is apparently trying to allocate nearly 100x more memory. Cheers and thanks again.