Chokes on large (30-35mb) XLS files, any way to fix?

nicholas-ochoa commented 6 years ago

Using 2.3, I'm able to load large XLSX files without issue (albeit they do take some time), and also able to load smaller (10-15mb) XLS files also without issue. Is there any way to optimize for larger XLS files to enable loading? Time is not an issue (I'm trying to automate a monthly data load process), but I haven't seen the load complete even overnight. It doesn't error out, so not sure what is wrong. I don't have a sample XLS file that I can attach (it contains client data), unfortunately.

mbleron commented 6 years ago

Hi Nicholas, What's the database version please? Are you using the DML API?

I'll try to reproduce the performance issue but I need additional info about the general "shape" of the file :

how many rows are there?
how many columns?
how many numeric vs. string columns?
is there any large string column (> 4000 bytes)?

Thanks.

mbleron commented 6 years ago

I made a small enhancement to XUTL_XLS package (v2.3.1). I suggest compiling XUTL_XLS in native mode too : alter package xutl_xls compile plsql_code_type=native plsql_optimize_level=3;

On my laptop running Oracle 12.1.0.2, I can load this file (78.4 MB, 65535 rows) in about a minute, using :

declare
  ctx    ExcelTable.DMLContext;
  nrows  pls_integer;
begin
  ctx := ExcelTable.createDMLContext('TEST_XLS');
  ExcelTable.mapColumn(ctx, 'ID', 'A');
  ExcelTable.mapColumn(ctx, 'NUM1', 'B');
  ExcelTable.mapColumn(ctx, 'NUM2', 'C');
  ExcelTable.mapColumn(ctx, 'NUM3', 'D');
  ExcelTable.mapColumn(ctx, 'DT1', 'E');
  ExcelTable.mapColumn(ctx, 'DT2', 'F');
  ExcelTable.mapColumn(ctx, 'TS1', 'G', 'DD-MON-RR HH.MI.SS.FF9 AM');
  ExcelTable.mapColumn(ctx, 'TEXT1', 'H');
  ExcelTable.mapColumn(ctx, 'TEXT2', 'I');
  ExcelTable.mapColumn(ctx, 'TEXT3', 'J');
  nrows := ExcelTable.loadData(ctx, ExcelTable.getFile('XL_DATA_DIR','tmp002.xls'), 'data', p_dml_type => ExcelTable.DML_INSERT);
  dbms_output.put_line(nrows);
end;
/

Let me know how it works in your case. (I'm still interested in the details asked earlier)

nicholas-ochoa commented 6 years ago

Sorry for the delay in response. I'm running on 12.1 DB and using 2.3 (not using DML API - that seems to be much slower and less flexible). I'm using the getRows API.

There are around 63,000 rows in the file, 70 columns. The columns are all character (defined in the p_cols param as VARCHAR2(4000)). The actual data consists of a few data types (char, numeric, dates) but I load everything as varchar2 and use a view on the data to do the actual conversion to the proper data types.

The largest string value is around 128 characters, but there is only a handful of columns like that. The overwhelming majority of columns have a value of either "Enabled" or "Disabled" (effectively boolean columns).

I've got the packages compiled using native/optimzation level 3 as well.

I will give the DML API a shot for XLS (I've only tried it on a 16mb XLSX file before), but fortunately the data set I'm getting is now being generated in XLSX format so I'm going to try loading it once more as I've had a lot more success with XLSX and the Java stream reader.

mbleron commented 6 years ago

I've published a new enhancement (v2.3.2). The XLS parser now uses a buffered LOB reader. The test case in my previous reply now runs in 30 secs, so a 50% gain.

The columns are all character (defined in the p_cols param as VARCHAR2(4000)).

It's not a good idea to project everything as VARCHAR2. In particular, setting a VARCHAR2 via the ODCI interface is generally more costly than a NUMBER. So whenever possible, always use the native/required data type directly, it'll save some conversion steps.

mbleron / ExcelTable

Chokes on large (30-35mb) XLS files, any way to fix? #5