neo4j / graph-data-science

Source code for the Neo4j Graph Data Science library of graph algorithms.
https://neo4j.com/docs/graph-data-science/current/
Other
637 stars 161 forks source link

Construction from dataframe, wrong IDs datatype #309

Closed Mintactus closed 5 months ago

Mintactus commented 5 months ago

The ID data type or behavior in GDS for dataframe construction with or without arrow is wrong. Here is why:

Creating IDs from hash values is really useful if not critical sometimes and widely adopted.

The native python hash function is using Int64, and it nativly creates negatives numbers as the Int64 type does. Polars hash function is using UInt64 and doesn't handle negative integers. Pandas hash function is using UInt64 and doesn't handle negative integers.

Your implementation of Int64 in GDS cannot handle negatives numbers and it cannot handle huge unsigned positive integers. So we are stuck and this type is useless for most of the ids created by hash functions.

knutwalker commented 5 months ago

Hi @Mintactus,

since GDS is implemented in Java and Java does not have unsigned integer types and no primitives to properly emulate an unsigned 64-bit integer, we will not be able to natively support this within GDS.

We recommend to use a hash function that does output either an Int64 or transform the result into an Int64, (e.g. by setting the sign bit to 0 or extending from an UInt32).

Mintactus commented 5 months ago

Thanks for the reply, I found a work around, a floor divide by 2 then a cast. The UInt isn't a bad solution either.

On Mon, May 27, 2024, 9:18 a.m. Paul Horn @.***> wrote:

Hi @Mintactus https://github.com/Mintactus,

since GDS is implemented in Java and Java does not have unsigned integer types and no primitives to properly emulate an unsigned 64-bit integer, we will not be able to natively support this within GDS.

We recommend to use a hash function that does output either an Int64 or transform the result into an Int64, (e.g. by setting the sign bit to 0 or extending from an UInt32).

— Reply to this email directly, view it on GitHub https://github.com/neo4j/graph-data-science/issues/309#issuecomment-2133470567, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHIBVDMZ5J5XPT4UBIUFSVTZEMXCDAVCNFSM6AAAAABIF56SS2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMZTGQ3TANJWG4 . You are receiving this because you were mentioned.Message ID: @.***>