neo4j / graph-data-science-client

A Python client for the Neo4j Graph Data Science (GDS) library
https://neo4j.com/product/graph-data-science/
Apache License 2.0
190 stars 46 forks source link

Export gds result to polars instead of pandas #653

Open Mintactus opened 5 months ago

Mintactus commented 5 months ago

https://pola.rs/

Polars is setting a brand new standard of data processing, it would be awsome to have it as an option for the output for a gds function. It could be an parameter you can chose when when you build the gds client, exportType = [pandas, polars, apache arrow IPC, etc. ]

Not just having pandas who is depreciated

gminneci commented 4 months ago

Hi there, thank you for bringing this to our attention. It's great to see performance improving and community interest in new libraries - we constantly monitor requests like this one. Pandas is still used and loved by the majority of our customers, while Polars is emerging. We will evaluate whether it's worth integrating natively, but in the meantime we will suggest using polars.from_pandas as an efficient workaround.

MichaelSchmidt1729 commented 4 months ago

+1 for exporting to polars

Mats-SX commented 4 months ago

Moving this to the GDS Python Client repository. The GDS library itself is agnostic to Pandas/Polars. Exports are possible using Bolt or Arrow. The internals of GDS are not based on Arrow, but are our own custom implementation, with some third party data structures (not Arrow itself).

Mats-SX commented 4 months ago

The GDS Python Client wraps the Neo4j Python Driver (https://github.com/neo4j/neo4j-python-driver) which dictates the basis of the GDS Python Client's export functionality for Cypher queries, through the Neo4j Python Driver's to_df() method (docs).

To get this Cypher driver to export to Polars as well, I suggest raising an issue on that repository. I will also mention it via Neo4j-internal channels.

The GDS Python Client can also export using Apache Arrow via the GDS Arrow Server. This does not use the Neo4j Python Driver, but makes an independent connection to the GDS Arrow Server using an Arrow client based on the pyarrow library. The pyarrow library returns results from the Arrow stream as Table (docs) objects, which have a to_pandas() (docs) method.

As @gminneci mentions, Polars support reading from a Pandas DataFrame, so it possible to hook up the workflow. It is not directly possible for the GDS Python Client to use a different method from the underlying pyarrow library. It is not perfectly in line with the purpose of the GDS Python library to support conversion between two third-party data structures (pyarrow.Table and polars.DataFrame). If either of pyarrow or Polars would support this, it would be more convenient. As it stands, conversion goes via polars.from_pandas(), which is still a more appropriate location compared to the GDS Python Client.

We are naturally very happy to see the interest in GDS and its software parts (library, client, database) so we are not rejecting this feature request. However, in the presence of workarounds and no very low-hanging possibilities for uniform integration (other than bundling Polars and calling from_pandas() within this library, which doesn't seem so attractive), we're keeping this tracked with no immediate plan to address it.

Thank you for raising this issue! All the best Mats