kuzudb / kuzu

Embeddable property graph database management system built for query speed and scalability. Implements Cypher.
https://kuzudb.com/
MIT License
1.36k stars 96 forks source link

Bug: data type problem with SERIAL #4330

Open ubmarco opened 2 weeks ago

ubmarco commented 2 weeks ago

Kùzu version

v0.6.0

What operating system are you using?

Arch Linux x64

What happened?

I'm running this list of queries against an empty DB:

CREATE NODE TABLE Person(id SERIAL, name STRING, PRIMARY KEY(id));
CREATE NODE TABLE Company(id STRING, name STRING, PRIMARY KEY(id));
MATCH (a) RETURN a;

and get this error: Binder exception: Expected the same data type for property id but found STRING and SERIAL. The error appears when running the MATCH query.

What's the problem? Thanks for helping out!

Are there known steps to reproduce?

docker run -p 8001:8000 --rm -e KUZU_IN_MEMORY=true kuzudb/explorer:latest

CREATE NODE TABLE Person(id SERIAL, name STRING, PRIMARY KEY(id));
CREATE NODE TABLE Company(id STRING, name STRING, PRIMARY KEY(id));
MATCH (a) RETURN a;
ubmarco commented 2 weeks ago

Both queries

MATCH (a:Person) RETURN a;
MATCH (a:Company) RETURN a;

return successfully with an empty response, but when I combine the PRIMARY KEY types SERIAL and STRING, e.g. with

MATCH (a:Person:Company) RETURN a;

I get Binder exception: Expected the same data type for property id but found STRING and SERIAL..

andyfengHKU commented 2 weeks ago

Hi @ubmarco,

This is expected behaviour that we cannot combine property with same name but different data types into the same column. Because that will break many optimizations we do internally.

I'll suggest a workaround to either use the same data type for id. Or use different primary key name, e.g.

CREATE NODE TABLE Person(p_id SERIAL, name STRING, PRIMARY KEY(p_id));
CREATE NODE TABLE Company(c_id STRING, name STRING, PRIMARY KEY(c_id));
MATCH (a) RETURN a;

The above will work.

In the meantime, I'll discuss some possible solutions with the team. Two things that come into my mind are

  1. Auto-rename internally. So we implicitly use p_id & c_id in the final result
    MATCH (a) RETURN a;
    {p_id:0, c_id:'something', name:'Alice'}
  2. Cast to the same data type. In the above case, we will case SERIAL id to STRING id.

Let me know if there are other solutions that make sense to you. And feel free to comment the above solutions.

ubmarco commented 2 weeks ago

Hi @andyfengHKU, thanks a lot for the quick answer. The explanation makes total sense and thanks also for providing a workaround. It's better than mine which is to have both id as STRING and then generating unique IDs on Python side with uuid.uuid4().hex[:10] which gives me values like 27738b1013. However this consumes memory on DB side and when transferring data. I'll certainly try your solution.

The other solution that came to mind was to emulate SERIAL on client side by defining id as INT64 and query the biggest existing number before inserting new items. This should work as there is only one writing client connection to the DB at a time.

I think documenting the behavior would be beneficial to newcomers.

Btw, thanks a lot for this awesome project and the active contributions, it certainly fills a gap in the Graph DB ecosystem.