pjfanning / excel-streaming-reader

An easy-to-use implementation of a streaming Excel reader using Apache POI
Apache License 2.0
113 stars 26 forks source link

How to read XLSX from file and not wait it's entire download to process it #103

Open oviniciuslara opened 2 years ago

oviniciuslara commented 2 years ago

Hey, I have a question. How can we process one XLSX file stored remotely, in this example as URL, without waiting the code to entire download it first? For large files it takes a long time.

Example:

InputStream is = new URL("https://filebin.net/qe5ynsl7ikmzap8a/LINEITEM_6M.xlsx").openStream();
Workbook workbook = StreamingReader
        .builder()
        .rowCacheSize(100)
        .bufferSize(4096) 
        .open(is); 
for (Sheet sheet : workbook) {
    System.out.println(sheet.getSheetName());
    for (Row r : sheet) {
        for (Cell c : r) {
            System.out.println(c.getStringCellValue());
        }
    }
}
pjfanning commented 2 years ago

isn't this the same as https://github.com/pjfanning/excel-streaming-reader/issues/38 ?

pjfanning commented 2 years ago

One suggestion would be to try https://github.com/dhatim/fastexcel. excel-streaming-reader uses Apache POI code to read the initial zip file (xlsx files are basically zip files) and this POI code reads the full file before making the inner data available to parse. fastexcel may use a different approach that might make the sheet data available earlier.

The way that excel file is split across multiple files inside the zip probably makes it pretty unlikely that most excel files can be read without buffering the whole file. Sheet XML references shared string data in a separate sharedStrings.xml file and the style data is in another file.

One scenario that might be fully streamable would be an xlsx that was not password protected and where sharedStrings.xml is not used. POI's SXSSFWorkbook can produce a file in this format but that is just one way to produce xlsx and other approaches would not be fully streamable. A variant of the existing excel-streaming-reader code could probably be produced to handle this case but I'm not volunteering to do it.

If you truly want to parse the data in a fully streaming way, I would suggest using a different data format - CSV, XML, JSON - these formats are much more streamable than xlsx.

oviniciuslara commented 2 years ago

Thanks for the quick response. Unfortunately the file format is not possible to be changed. Do you have any idea how can I change the library to work accordingly?

pjfanning commented 2 years ago

If you read my comment above - in particular, the paragraph beginning with 'One scenario that might be fully streamable'. Only in this scenario, this algorithm will work.

I would like to reiterate that Excel is a really bad format for streaming large data over the internet - the format was not designed for this use case.

oviniciuslara commented 2 years ago

In my use case I only need to read the first 1000 rows of the sheet and I don't need the exact values of the cells (just the non formulated content), this would ease things, right?

pjfanning commented 2 years ago

In my use case I only need to read the first 1000 rows of the sheet and I don't need the exact values of the cells (just the non formulated content), this would ease things, right?

no - excel usually puts all the strings in a different file from the sheet data - this is not streaming friendly - if the xlsx only has numbers, then you might be ok

oviniciuslara commented 2 years ago

I see, well I will have to look a little deeper. I will try to post the solution here.

In the meantime, do you have any clue how to do this same stream fashion with XLS files? Or know if it’s even possible?

pjfanning commented 2 years ago

xls is not streamable at all - Microsoft did not write their formats with your use case in mind - there are much much better data formats out there