polydbms / sheetreader-duckdb

MIT License
44 stars 4 forks source link

SheetReader DuckDB extension

sheetreader is a DuckDB extension that allows reading XLSX files into DuckDB tables with SheetReader, our blazingly fast XLSX parser (https://github.com/polydbms/sheetreader-core).


This repository is based on https://github.com/duckdb/extension-template.

Table of Contents

Usage

Before using SheetReader, you need to install it from the community extensions and load it into your DuckDB-environment:

INSTALL sheetreader FROM community;
LOAD sheetreader;

Now, you can run your first query:

D SELECT *
  FROM sheetreader('test.xlsx');

The sheetreader() function offers further parameters to load the XLSX-file as required:

D CREATE TABLE test AS FROM sheetreader(
    'test.xlsx',
    sheet_index=1,
    threads=16,
    skip_rows=0,
    has_header=TRUE,
    types=[BOOLEAN,VARCHAR],
    coerce_to_string=TRUE,
    force_types=TRUE
  );

Parameters

Name Description Type Default
sheet_index Index of the sheet to read. Starts at 1. INTEGER 1
sheet_name Name of the sheet to read.
Only either sheet_index or sheet_name can be set.
VARCHAR ""
threads Number of threads to use, while parsing INTEGER Half of available cores; minimum 1
skip_rows Number of rows to skip INTEGER 0
types List of types for all columns
  • Types currently available:
    VARCHAR,BOOLEAN,DOUBLE, DATE.
  • Useful in combination with coerce_to_string and force_types.
LIST(VARCHAR) Uses types determined by first & second row (after skipped rows)
coerce_to_string Coerce all cells in column of type VARCHAR to string (i.e. VARCHAR). BOOLEAN false
force_types Use types even if they are not compatible with types determined by first/second row.
Cells, that are not of the column type, are set to NULL or coerced to string, if option is set.
BOOLEAN false
has_header If set to true:
  • Force to treat first row as header row (only works if all cells are of type VARCHAR).
  • If successful, the cell contents are used for column names.
  • Will overwrite the default behavior, which doesn't use the first row as headers, if all columns have type VARCHAR.

If set to false:
  • The extension will still try to treat the first row as header row.
  • The difference is that it will not fail, if the first row is not usable (i.e. not all cells are of type VARCHAR).
  • The first row won't be used as headers, if all columns have type VARCHAR.
BOOLEAN false

More information on SheetReader

SheetReader was published in the Information Systems Journal.

@article{DBLP:journals/is/GavriilidisHZM23,
  author       = {Haralampos Gavriilidis and
                  Felix Henze and
                  Eleni Tzirita Zacharatou and
                  Volker Markl},
  title        = {SheetReader: Efficient Specialized Spreadsheet Parsing},
  journal      = {Inf. Syst.},
  volume       = {115},
  pages        = {102183},
  year         = {2023},
  url          = {https://doi.org/10.1016/j.is.2023.102183},
  doi          = {10.1016/J.IS.2023.102183},
  timestamp    = {Mon, 26 Jun 2023 20:54:32 +0200},
  biburl       = {https://dblp.org/rec/journals/is/GavriilidisHZM23.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

Benchmarks

You can find benchmarks in the above-mentioned paper, comparing SheetReader to other XLSX parsers.

Here is a plot of preliminary benchmarks comparing the sheetreader DuckDB extension to the spatial extension's st_read function:

Benchmark

(System info: 2x Intel(R) Xeon(R) E5530 @ 2.40GHz, 47GiB RAM)

Building yourself

First, clone this repository with the --recurse-submodules flag --- so you get all the needed source files.

To build the extension, run:

GEN=ninja make

The main binaries that will be built are:

./build/release/duckdb
./build/release/extension/sheetreader/sheetreader.duckdb_extension

Running the extension

To run the self-built extension code, simply start the shell with ./build/release/duckdb.