kuzudb / kuzu

Embeddable property graph database management system built for query speed and scalability. Implements Cypher.
https://kuzudb.com/
MIT License
1.25k stars 88 forks source link

Feature: PySpark integration #3774

Open BjarkeTornager opened 1 month ago

BjarkeTornager commented 1 month ago

API

Python

Description

Have you considered making an integration between Kùzu and PySpark?

Neo4j, as an example, has a Neo4j connector for Apache Spark.

Spark also has a community project called GraphFrames that can be used for basic graph algorithms.

Since Spark is widely used for analytical queries, machine learning, and streaming it could be useful to move between the two.

prrao87 commented 1 month ago

Hi @BjarkeTornager, this is something that could be on the roadmap but not yet been prioritized as we typically wait for several upvotes from the community to decide how much to prioritize new integrations. There are numerous other integrations already underway for our 0.5.0 release and beyond, so hope you can understand. In the meantime, we are also releasing a basic graph algorithms package soon that can provide some of the functionality that GraphFrames does, so stay tuned!

BjarkeTornager commented 1 month ago

Thanks @prrao87, looking forward to the Kùzu basic graph algorithm package!

abhiwattpad commented 1 month ago

It would be have to have spark integration with kuzu, especially for large scale data ingestion!

prrao87 commented 3 weeks ago

Just adding some scope for initial functionality here: The proposed integration would behave just like the Pandas/Polars DataFrame integration does:

Unlike Pandas/Polars, the I/O and related tasks may not be fully in-memory - we'd need to see how the persistent formats under the hood of Spark work, and also how to design the API to expose the connector to the Python client of Kùzu.