monitorjbl / excel-streaming-reader

An easy-to-use implementation of a streaming Excel reader using Apache POI
Apache License 2.0
936 stars 343 forks source link

Unable to read large excel files #266

Open fivetran-satvikpatil opened 8 months ago

fivetran-satvikpatil commented 8 months ago

Hi guys, I am unable to read large Excel files.

I am using the below code to fetch the workbook:

StreamingReader.builder()
                    .rowCacheSize(100) 
                    .bufferSize(4096) 
                    .open(inputStreamSupplier.get());

It is failing with below error :

java.lang.OutOfMemoryError: Requested array size exceeds VM limit
    at java.base/java.util.Arrays.copyOf(Arrays.java:3537)
    at java.base/java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:100)
    at java.base/java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:130)
    at org.apache.poi.util.IOUtils.toByteArray(IOUtils.java:185)
    at org.apache.poi.util.IOUtils.toByteArray(IOUtils.java:149)
    at org.apache.poi.util.IOUtils.toByteArray(IOUtils.java:136)
    at org.apache.poi.openxml4j.util.ZipArchiveFakeEntry.<init>(ZipArchiveFakeEntry.java:47)
    at org.apache.poi.openxml4j.util.ZipInputStreamZipEntrySource.<init>(ZipInputStreamZipEntrySource.java:53)
    at org.apache.poi.openxml4j.opc.ZipPackage.openZipEntrySourceStream(ZipPackage.java:210)
    at org.apache.poi.openxml4j.opc.ZipPackage.openZipEntrySourceStream(ZipPackage.java:194)
    at org.apache.poi.openxml4j.opc.ZipPackage.openZipEntrySourceStream(ZipPackage.java:168)
    at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:149)
    at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:277)
    at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:186)
    at com.monitorjbl.xlsx.impl.StreamingWorkbookReader.init(StreamingWorkbookReader.java:113)
    at com.monitorjbl.xlsx.impl.StreamingWorkbookReader.init(StreamingWorkbookReader.java:91)
    at com.monitorjbl.xlsx.StreamingReader$Builder.open(StreamingReader.java:251)

The memory assigned is 8GB, but it is still failing with this error. By observing the stack trace and also the heap dump, I noticed that we are creating a byte array with the total size of around 2.1GB. How can we fix this issue?

pjfanning commented 8 months ago

This project is unmaintained as far as I can see. I have a fork and have some docs about some POI settings that can further reduce memory usage. https://github.com/pjfanning/excel-streaming-reader#reading-very-large-excel-files

There is also https://github.com/dhatim/fastexcel

mcv commented 6 months ago

Is your version sufficiently maintained that I should be using yours instead of forking this project myself? (I've got my own fork where I fixed a bunch of issues.)

And if not, is it possible to take over the maintenance of this project? I need this for a project, and I might be in a position to take over if necessary.

pjfanning commented 6 months ago

@mcv have you looked at my fork? https://github.com/pjfanning/excel-streaming-reader

mcv commented 6 months ago

Looking at it now. You've got a lot more work in it than I have. Does this mean your fork is meant to be the official branch now? I'll switch over. That does make my life a lot easier.

pjfanning commented 6 months ago

Looking at it now. You've got a lot more work in it than I have. Does this mean your fork is meant to be the official branch now? I'll switch over. That does make my life a lot easier.

It's not the official fork but I am not aware of any forks that have as many changes as mine. PRs on my fork are welcome.

mcv commented 6 months ago

Unfortunately I can't use your version because of dependency conflicts. I'll stick with my own for now. Hopefully I can switch to yours later.