monitorjbl / excel-streaming-reader

An easy-to-use implementation of a streaming Excel reader using Apache POI
Apache License 2.0
946 stars 343 forks source link

StreamingReader cannot read all data from one sheet #217

Open yuzhangyi opened 4 years ago

yuzhangyi commented 4 years ago

Hi, @monitorjbl , when I try to use StreamingReader to read data from one worksheet, it only return the data in cached rows (for example, 100 rows of data, but my worksheet has 100,000,000 rows, do I miss any configuration? could you help give some guidance? thanks!

override def openWorkbook(): Workbook = { fileExtension match { case "xlsx" => { StreamingReader.builder() .rowCacheSize(100) .bufferSize(4096)
.open(inputStreamProvider) } case "xls" => { WorkbookFactory.create(inputStreamProvider) } case _ => failure("EXCEL-EXE-09", s"Invalid file is provided with format $fileExtension while only .xlsx and .xls are supported.").get } }

The other case is that, it throw exception directly: 20/07/21 13:22:43 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) com.monitorjbl.xlsx.exceptions.ParseException: Error reading XML stream at com.monitorjbl.xlsx.impl.StreamingSheetReader.getRow(StreamingSheetReader.java:76) at com.monitorjbl.xlsx.impl.StreamingSheetReader.access$200(StreamingSheetReader.java:32) at com.monitorjbl.xlsx.impl.StreamingSheetReader$StreamingRowIterator.hasNext(StreamingSheetReader.java:402) at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:220) at scala.collection.Iterator$$anon$1.hasNext(Iterator.scala:1002) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)

brunotot commented 4 years ago

Same thing happened to me too. Had about 200k rows and StreamingReader wasn't able to load all the data. Instead, it just loaded the amount of rows which was declared in rowCacheSize method. My workaround was to just set rowCacheSize parameter higher than the number of rows the excel file had.