API for bulk loading of data

ragnard commented 9 months ago

Currently, the only efficient way to create large databases in Kùzu appears to be using the COPY FROM clause in Cypher. This is convenient, but somewhat limited because only the sources/formats supported by COPY FROM can be used. Support for additional sources/formats requires direct support in Kùzu.

Given todays explosion of different data-storage solutions and data/table formats it would be great if users of Kùzu could easily integrate their data without requiring support in the database itself, or having to go through a conversion to CSV or some other supported format.

A possible solution is to provide some form of bulk-insert API.

Such an API would have to provide:

performance on par with COPY FROM
...

DuckDB has the Appender interface for bulk-insert, which can perhaps be an inspiration.

One real-world use case would be creating a Kùzu database from Iceberg tables stored on S3. Iceberg is a non-trivial table-format to implement and there is no native C/C++ library. There is however an excellent official implementation in Java. If Kùzu had a bulk-insert API it would be (should be) trivial to consume data from Iceberg using its official API, and bulk-insert into Kùzu.

nichtich commented 8 months ago

Please try not to invent your own bulk import format but first support OpenCypher/Neo4J CSV import format. Both Amazon Neptune and Neo4J support the same format as CSV header format and OpenCypher CSV format respectively (although the OpenCypher community has nothing to do with this format). The only difference I found so far is Neptune does not allow to configure array-delimiter (set to ; by default) so importing values with a semicolon is limited in Neptune. Both Neo4J and Neptune don't allow to escape the array delimiter so there must be one character not importable via this format, that's the only limitation.

ray6080 commented 8 months ago

hi @nichtich , what do you mean "support OpenCypher/Neo4J CSV import format"? We support standard CSV format, and headers are optional, thus no CSV header format is needed as in Neo4j. Instead, we require users to define table schemas (please see here for the doc https://kuzudb.com/docusaurus/cypher/data-definition/create-table) before importing data.

semihsalihoglu-uw commented 8 months ago

Let me clarify several things. Our data model, which I call structured property graph model has 2 differences:

Kuzu does not allow nodes to have multiple labels.
Kuzu does not allow nodes and relationships to have arbitrary key-value properties. Instead we require users to structure and predefine their properties in a schema.

A better way to think of our data model is that it is closer to the relational model, with the distinction that users need to specify which tables are node/entity tables, and which ones are relationships. Due to these two reasonds, we cannot directly adopt the formats you mention, which may contain multiple node labels, and non-overlapping properties on nodes with the same label.

One thing we can do is to have a separate feature to load data from this CSV format with the restriction that we only parse the first node label for nodes with multiple labels and take the union of all the properties that we see per node label. Or we simply omit or error if we detect nodes with multiple labels (or we can give a parameter to configure this behavior). This could be implemented with an initial pass over the CSV files and first inferring the schemas of the node and relationship tables. Then creating those tables. Then, a second pass that ingests data to the tables. This requires quite a bit of work but is doable. In case loading performance is very important, a much faster way is to of course ingest CSV files/per node or relationship table directly.

nichtich commented 8 months ago

Kuzu supports its own form of CSV where each node type/table is one file. Furthermore a table schema has to be defined in advanced. In contrast Neo4J CSV import format and Neptune CSV import format use two CSV files, one for nodes and one for edges with a common schema as first header line for all nodes and edges, respectively.

Thanks @semihsalihoglu-uw for clarification of the Kuzu data model. Neo4J and Neptune have their own restrictions compared to the general abstract property graph model, so different formats make sense (by the way I'm working in a tool to convert between different property graph formats and databases and would like to support Kuzu as well).

If this issue is only about bulk loading in general, I'd recommend to document a form of bundling CSV files and schemas, e.g. a directory with files for instance

data/user.schema
data/user.csv
data/follows.schema
data/follows.csv

Just give this practice a name such as "Kuzu database import directory format" and implement a command to load the whole graph from directory data.

If this issue is about improving interoperability then support of any other property graph format would help. As far as I know only YARS-PG supports embedded schemas. OpenCypher/Neo4J CSV import format has the benefit of already being documented and established but as you wrote it (and all other formats) require a pre-processing to infer a schema from given data. On the other hand this task is required anyway for most real-world data.

semihsalihoglu-uw commented 8 months ago

OK, let's think a bit more about about your comments before taking any actions. I think the original issue by @ragnard was about performance of bulk loading.

kuzudb / kuzu

API for bulk loading of data #2739