monitorjbl / excel-streaming-reader

An easy-to-use implementation of a streaming Excel reader using Apache POI
Apache License 2.0
943 stars 342 forks source link

Alternate for sheet.getRow(0) #230

Closed saranshbansal closed 3 years ago

saranshbansal commented 3 years ago

sheet.getRow(0) is unsupported. Can you please provide an alternative to read header row?

It's absolutely necessary in order to fix the issues with EMPTY cells/rows not being read by default.

monitorjbl commented 3 years ago

A sheet is effectively a row iterator, so reading arbitrary rows by index cannot be supported. This library reads rows in-order as they are defined in the XML, it cannot jump around to different rows. If you need that ability, you should use vanilla POI.

The same is true of "empty" cells and rows not being read; if they are empty, they are not actually in the XML and so there is nothing to be read. The row and cell numbers should be available on all entries though, so it's relatively trivial to detect when there are gaps between entries. Those gaps are the empty rows/cells you're describing.

saranshbansal commented 3 years ago

By that logic, isn't a Row just a Cell iterator? Then why does row.getCell(0) works but former doesn't?

saranshbansal commented 3 years ago

Anyway, I worked out something:

try (Workbook wb = StreamingReader.builder()
                                   .rowCacheSize(DEFAULT_ROW_CACHE_SIZE)
                                   .bufferSize(DEFAULT_BUFFER_SIZE)
                                   .open(inputStream)) {
            Sheet firstSheet = wb.getSheetAt(0);
            int noOfColumns = 0;
            int rowIndex = 0;

            for (Row nextRow : firstSheet) {
                if (rowIndex == 0) {
                    noOfColumns = nextRow.getLastCellNum();
                }

                for (int i = 0; i < noOfColumns; i++) {
                    Cell cell = nextRow.getCell(i);
                    sb.append(cell != null ? cell.getStringCellValue() : EMPTY).append(delimiter);
                }
                sb.append(DEFAULT_LINE_END);

                ++rowIndex;
            }
        }
monitorjbl commented 3 years ago

By that logic, isn't a Row just a Cell iterator? Then why does row.getCell(0) works but former doesn't?

Excellent question. A row is read in its entirety, a sheet is not. The XML document is structured like so:

sheet
  └ row
     └ cell
  └ row
     └ cell
  └ row
     └ cell

For performance reasons, this library will read ahead a configurable number of rows and cache them for your code to iterate over. Not reading the entire row would require that only one row is processed at a time because we can't skip ahead anymore.

saranshbansal commented 3 years ago

Okay fair enough. The solution is working fine without any performance drop so closing the ticket.