mbleron / ExcelTable

ExcelTable is an Oracle SQL interface to read Microsoft Excel files (.xlsx, .xlsm, .xlsb, .xls, .xml), ODF spreadsheet files (.ods) and flat files as external tables
MIT License
45 stars 12 forks source link

Added getSheets function #21

Closed gpaulissen closed 4 years ago

gpaulissen commented 4 years ago

Hi,

First of all thank you for this functionality that I have used to create a generic Apex load file application. What I missed was a getSheets function so that the user can choose the sheet to load. I have added that function (as well as some documention enhancements). You can see it in the CHANGELOG section (5.x.y). There is quite some duplication of code in getSheets (a copy from getRawCells) but I did not want to change too much.

Kind regards,

Gert-Jan Paulissen

mbleron commented 4 years ago

Hi Gert-Jan,

Thanks for this contribution. However I don't understand why you've based getSheets on getRawCells. There's no point in fetching cells in this case, just read the sheet list directly from the cache. Or Am I missing something in your requirement?

gpaulissen commented 4 years ago

Hi,

Well, actually I did not take a very deep look into the code and found getRwaCells a good candidate. But I will try to improve upon it.

gpaulissen commented 4 years ago

Hi,

I have simplified the getSheets() code and using the samples spreadsheets I get the results I want.

Except for the big files (bigfile.xls and sample_2.xlsx) since there I get: ORA-04036: PGA memory used by the instance exceeds PGA_AGGREGATE_LIMIT

Here an excerpt:

select from table(exceltable.getsheets(ExcelTable.getFile('SAMPLES_DIR','bigfile.xlsx'))) ERROR at line 1: ORA-04036: PGA memory used by the instance exceeds PGA_AGGREGATE_LIMIT ORA-06512: at "TOOLS.EXCELTABLE", line 5150 ORA-06512: at "XDB.DBMS_XMLDOM", line 5114 ORA-06512: at "XDB.DBMS_XMLDOM", line 5132 ORA-06512: at "TOOLS.EXCELTABLE", line 2609 ORA-06512: at "TOOLS.EXCELTABLE", line 2655 ORA-06512: at "TOOLS.EXCELTABLE", line 2749 ORA-06512: at "TOOLS.EXCELTABLE", line 3157 ORA-06512: at "TOOLS.EXCELTABLE", line 5137 ORA-06512: at line 1

mbleron commented 4 years ago

I have simplified the getSheets() code and using the samples spreadsheets I get the results I want.

I've just checked the new version, indeed that's more along the lines I was thinking about.

After posting my first comment, I though that maybe you wanted to get only sheets actually having data, and then getRawCells would make sense as a starting point. It would require a way to limit the amount of cells returned in the process though, for example by introducing a sampling parameter. Maybe I'll implement that later...

Regarding the PGA memory error, the usual solution is to switch to STREAM_READ method, instead of DOM_READ, but that would require the Java components, which are still optional. Anyway that would just be a workaround as far as getSheets function is concerned. The long term solution is to defer building DOM objects until they're really needed, i.e. just before requesting rows, or even better, only access the relevant part(s) of the spreadsheet document if we're in a "getSheets" context. Let me explore that last option.

gpaulissen commented 4 years ago

Yeah, I am only interested in the sheets having data and the getRawCells function seemed most adequate but I confess I did not have the time to wade through the code which is not so easy to follow at first sight. Not an offense but I know these files are complex so I took the easy way out.

Those big files, we'll see. I am happy for the time being. I know now that switching to STREAM_READ will work. I will add that parameter to the getSheets function.

gpaulissen commented 4 years ago

I have changed the default method to read to STREAM_READ and a fallback to DOM_READ in getSheets(). That works perfectly if Java is installed and fails on the big (XML) files as before. For me, I am satisfied.