neo4j / graph-data-science

Source code for the Neo4j Graph Data Science library of graph algorithms.
https://neo4j.com/docs/graph-data-science/current/
Other
596 stars 157 forks source link

Cannot project a list of integers as a node property #247

Closed tomasonjo closed 1 year ago

tomasonjo commented 1 year ago

GDS version: 2.3.0 Neo4j version: 5.4.0 Operating system: Sandbox

Tried to reproduce on a smaller dataset, but it worked there, so I am attaching the flow that reproduces the error:

// Import
LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/tomasonjo/blog-datasets/main/survey/responses.csv" AS row
CREATE (p:Person)
SET p += row;

// Convert to integers
:auto MATCH (p:Person)
CALL {
    WITH p
    UNWIND keys(p) AS key
    WITH p, key WHERE toInteger(p[key]) IS NOT NULL
    CALL apoc.create.setProperty(p, key, toInteger(p[key])) YIELD node
    RETURN distinct 'done' AS result
} IN TRANSACTIONS OF 100 rows
RETURN distinct result;

// Fill in missing values

MATCH (p:Person)
UNWIND keys(p) AS key
WITH p, key
WHERE toInteger(p[key]) IS NOT NULL
WITH key, avg(p[key]) AS averageValue
MATCH (p1:Person) WHERE p1[key] IS NULL
CALL apoc.create.setProperty(p1, key, round(averageValue)) YIELD node
RETURN distinct 'done';

// construct a vector
WITH ["Personality", "Music", "Dreams", "Movies", "Fun with friends", "Comedy", "Medicine", "Chemisty", "Shopping centres", "Physics", "Opera", "Animated", "Height", "Weight", "Age", "Number of siblings"] AS excludedProperties
MATCH (sp:Person)
WITH [x in keys(sp) WHERE toInteger(sp[x]) IS NOT NULL AND NOT x IN excludedProperties | x] AS allKeys
LIMIT 1
MATCH (p:Person)
UNWIND allKeys as key
WITH p, key
ORDER BY key
WITH p, collect(p[key]) AS vector
SET p.vector = vector;

// project the graph

CALL gds.graph.project('survey', 'Person', '*', {nodeProperties:['vector']});

The error is:

Failed to invoke procedure gds.graph.project: Caused by: java.lang.UnsupportedOperationException: Cannot safely convert DoubleArray[3.0, 2.0, 1.0, 1.0, 4.0, 5.0, 3.0, 5.0, 5.0, 3.0, 5.0, 3.0, 1.0, 2.0, 5.0, 3.0, 5.0, 3.0, 2.0, 1.0, 2.0, 1.0, 1.0, 5.0, 4.0, 2.0, 1.0, 5.0, 1.0, 5.0, 5.0, 1.0, 2.0, 5.0, 3.0, 2.0, 3.0, 1.0, 1.0, 5.0, 1.0, 2.0, 1.0, 2.0, 1.0, 4.0, 2.0, 1.0, 1.0, 4.0, 5.0, 1.0, 1.0, 4.0, 2.0, 1.0, 3.0, 3.0, 2.0, 4.0, 4.0, 3.0, 4.0, 3.0, 4.0, 4.0, 4.0, 1.0, 5.0, 3.0, 5.0, 5.0, 4.0, 1.0, 5.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 5.0, 2.0, 4.0, 4.0, 5.0, 4.0, 4.0, 5.0, 5.0, 2.0, 3.0, 4.0, 3.0, 2.0, 2.0, 3.0, 4.0, 3.0, 5.0, 4.0, 3.0, 1.0, 5.0, 1.0, 4.0, 1.0, 4.0, 5.0, 1.0, 1.0, 2.0, 1.0, 3.0, 4.0, 3.0, 1.0, 3.0, 1.0, 5.0, 3.0, 4.0] into a Long Array

The issue can safely be avoided by casting elements of the vector as a float:

// construct a vector
WITH ["Personality", "Music", "Dreams", "Movies", "Fun with friends", "Comedy", "Medicine", "Chemisty", "Shopping centres", "Physics", "Opera", "Animated", "Height", "Weight", "Age", "Number of siblings", "vector"] AS excludedProperties
MATCH (sp:Person)
WITH [x in keys(sp) WHERE toInteger(sp[x]) IS NOT NULL AND NOT x IN excludedProperties | x] AS allKeys
LIMIT 1
MATCH (p:Person)
UNWIND allKeys as key
WITH p, key
ORDER BY key
WITH p, collect(toFloat(p[key])) AS vector
SET p.vector = vector;
FlorentinD commented 1 year ago

Hey @tomasonjo , thanks for reporting the issue! checking in our code, it is indeed not supported at the moment to convert arrays. You can see at https://github.com/neo4j/graph-data-science/blob/2af46276d44c6b6daf88101ba78e559dbf123fcf/core/src/main/java/org/neo4j/gds/utils/Neo4jValueConversion.java#L56 what auto conversions we support. We even have a test case where we expect this.

I like the idea of supporting this conversion though, if every float is safe to cast to a long.

tomasonjo commented 1 year ago

It seems it's got something to do with Neo4j internals? Most of the vectors should be LongArray to begin with. If I rerun the vector generating statement I get:

Invalid input for function 'toInteger()': Expected a String, Number or Boolean, got: LongArray[4, 2, 5, 4, 1, 1, 4, 1, 1, 3, 4, 5, 1, 1, 1, 2, 2, 3, 5, 2, 5, 2, 5, 1, 2, 2, 3, 3, 1, 3, 3, 1, 5, 4, 3, 5, 3, 1, 5, 2, 5, 3, 3, 1, 1, 5, 3, 5, 5, 3, 1, 2, 4, 1, 4, 1, 4, 1, 1, 1, 4, 1, 3, 5, 3, 4, 3, 1, 1, 1, 3, 1, 3, 1, 3, 1, 3, 4, 3, 3, 4, 1, 4, 1, 5, 2, 5, 5, 1, 3, 3, 3, 1, 4, 1, 3, 5, 3, 4, 4, 4, 1, 4, 3, 1, 5, 3, 1, 3, 3, 1, 1, 1, 1, 2, 2, 2, 5, 3, 1, 1, 4, 2, 5]

It seems there are some rogue DoubleArray properties due to:

CALL apoc.create.setProperty(p1, key, round(averageValue)) YIELD node

explicitly casting the output to integer solves the issue:

CALL apoc.create.setProperty(p1, key, toInteger(round(averageValue))) YIELD node

Do you think it is worthwhile to report this to Neo4j repo as you would imagine that the round() would output a consistent data format.

tomasonjo commented 1 year ago
MATCH (p:Person)
UNWIND keys(p) AS key
WITH p, key
WHERE toInteger(p[key]) IS NOT NULL
RETURN apoc.meta.cypher.type(p[key]), count(*)
type count
"INTEGER" 139819
"FLOAT" 571
FlorentinD commented 1 year ago

I was not aware that round can produce double values in Neo4j. Would be interesting to know what the underlying logic is. Let me get someone to give us some background first :)

FlorentinD commented 1 year ago

According to https://neo4j.com/docs/cypher-manual/current/functions/mathematical-numeric/#functions-round, round does return a float. But this does not explain why its not consistent

tomasonjo commented 1 year ago

Ok I get it... the round() always produces doubles as there are only 571 missing values. So that's it, we have some rogue double arrays in between, but mostly long arrays, and therefore there is a problem. The joys of schema-less database :)

You can close this issue if you wish

FlorentinD commented 1 year ago

Thanks for finding out

Will do once we merged our PR around adding more automatic conversion support :)

FlorentinD commented 1 year ago

Just merged an improvement for the automatic conversion of array properties, which will be part of 2.4

xpilasneo4j commented 1 year ago

Hi. As discussed, please my use case which generates the same error. Sorry I didn't made it simpler but it's quite quick to setup. Just load the 3 csv files using "medical_cypher.txt" and then run "gds_issue.txt" to have the last command showing the issue. Seems it's coming from the nodes labeled Drug as it works once removed. I checked the type of the property and it say LIST OF FLOAT. Let me know if I did something wrong... EE V5.5, GDS v2.3.1 medical.zip

FlorentinD commented 1 year ago

(as discussed) The gds.fastRP.write will write back float[] but the Drug and Diag nodes will have only double[] values as the Cypher toFloatList creates doubles (as Cypher only uses doubles).

I would suggest to only load the fastRP_Embed for the Patient nodes.

Mats-SX commented 1 year ago

This issue is already solved, but for any potential future reader, I want to offer this reply:

I've reproduced the error. I notice that you do not write any FastRP embeddings for the :Drug nodes. That's explicitly the effect of the nodeLabels filter that you configure in the fastrp.write query:

CALL gds.fastRP.write($graphname,{
    embeddingDimension: 6,
    nodeLabels: ['Patient'],   // <--- here we only use Patient nodes
    randomSeed: 42,
        propertyRatio: 1.0,
    nodeSelfInfluence: 1.0,
        featureProperties: ['Diab', 'Dyslipidaemia', 'Osteo', 'Hypertension'],
    writeProperty: 'fastRP__Embed'
});

Then, in your projection query, you use the syntactic sugar short-form to specify node properties to project. Properties specified in this form apply to all projected node labels. Since you want to project the fastRP__Embed property that you wrote to Patient nodes you specify that property. But it does not exist on any of the Drug nodes. As a result, the node projection is ill-formed, I would say. You will get a fastRP__Embed property for Drug nodes, but it will be all zero-vectors. Not very useful.

The quickest fix, in my opinion, is to separate the node property configuration and make these label-specific. Here is an example:

WITH {
  Patient: { properties: ['targetProperty', 'isTrain', 'fastRP__Embed'] }, 
  Drug: { properties: { targetProperty: {}, isTrain: {} } }
} AS nodeProjection
CALL gds.graph.project($graphname2, nodeProjection, 'PRESCRIBED')
YIELD graphName, nodeProjection, nodeCount AS nodes, relationshipCount AS rels
RETURN graphName, nodeProjection AS patientProjection, nodes, rels;