kuzudb / kuzu

Embeddable property graph database management system built for query speed and scalability. Implements Cypher.
https://kuzudb.com/
MIT License
1.31k stars 94 forks source link

Feature: Implicitly generate null value column when using `COPY` #4305

Open prrao87 opened 4 days ago

prrao87 commented 4 days ago

API

Other

Description

I have this scenario where I have a CSV/Parquet file with just two columns:

product   price
Laptop 1100.0
Mouse 150.0
Headphones 250.0

In my DDL, I want to add a nullable column historical_price as follows:

CREATE NODE TABLE IF NOT EXISTS Product(name STRING, price DOUBLE, historical_sales INT32, PRIMARY KEY (name));

I want to initialize the node table with all values in the historical_sales columns as nulls, and add them via MERGE at a later time when they become available.

Issue

Because my input file has just 2 columns, and my DDL specifies 3 columns, I cannot use COPY Product FROM 'product.parquet' directly.

I instead have to do this:

COPY Product(name, price) FROM 'data/product.parquet'

Feature request

In the COPY pipeline, because we already know the number of columns in the input prior to importing the data, can we implicitly infer the column names so that we can use the much simpler DDL command below? It would reduce the mental burden on the user, as it's expected that the columns that are absent in the input file would have to be filled with nulls.

COPY Product FROM 'data/product.parquet'

@ray6080 I think this seems like a reasonable feature, but if you think it's infeasible, feel free to close.

ray6080 commented 3 days ago

In the COPY pipeline, because we already know the number of columns in the input prior to importing the data, can we implicitly infer the column names so that we can use the much simpler DDL command below?

I think it can introduce some confusing behaviours if we implicitly infer column names, such as what if column names don't exactly match? or when there are no header information in the source? In that sense, COPY Product(name, price) FROM 'source' is much less prone to confusions I think. But let's do more survey on this and see how other systems handle such cases before we jump to a conclusion.