kuzudb / kuzu

Embeddable property graph database management system built for query speed and scalability. Implements Cypher.
https://kuzudb.com/
MIT License
1.37k stars 96 forks source link

Tutorial on PyG integration broken with kuzu 0.0.11. #2291

Open irustandi opened 12 months ago

irustandi commented 12 months ago

I’m trying to run the tutorial on the integration with PyG. After fixing some problems in the code (PR), I found out that, while it still works with kuzu 0.0.3, the example doesn’t work with kuzu 0.0.11 (for each run, the data were loaded to the DB with the same kuzu version).

Narrowing it down, I’ve isolated the problem to this call

attr = torch_geometric.data.TensorAttr(group_name="paper", attr_name="id", index=torch.tensor(np.array([index])))
result = feature_store.get_tensor(attr) # feature_store is returned by get_torch_geometric_remote_backend()

The expected behavior is that result contains the same value as index, but when running it locally with kuzu 0.0.11, I found that when index >= 2048, it is possible that result != index. As an example, with index 2048, I got result 12288. This messes up the training code because it expects all elements in the batch to contain labels, but the above will make it possible that we don’t have labels for many elements in the batch.

mewim commented 11 months ago

Hi @irustandi,

Thanks for reporting this and also helping us fix the typos in the PyG example!

We have looked into the problem. The root cause of this issue is that now the NPY files are copied in parallel into the node tables, which may cause the values in the NPY file to be written to the node table in arbitrary order and make the order of the values in the node table to be different from the order of values in the original NPY file, so the scanning produces incorrect results.

A quick fix to this issue is to force copying the NPY files sequentially to the tables, so that the order is preserved. It can be done by changing this line to conn = kuzu.Connection(db, num_threads=1). Note that it should still be fine to run the remote backend with multi-threading, only copying needs to be done with 1 thread.

However, we should be able to copying the NPY files in parallel while preserving the order. This would require some changes in our copying pipeline. I will leave this to @ray6080 and @acquamarin.