neo4j / apoc

Apache License 2.0
84 stars 27 forks source link

apoc.create.virtual.fromNode should wrap the existing node while keeping it's node-id and elementId but it doesn't #570

Open jexp opened 8 months ago

jexp commented 8 months ago

e.g. apoc.create.virtual.fromNode(movie, ["title","released","url"])

returns virtual nodes with negative ids but it shouldn't in this case.

Seems like an implementation bug that somehow crept in.

Need it for excluding node embeddings and large text properties for returning in neo4j tools

Lojjs commented 8 months ago

@jexp Do you want us in Cypher surface to look into this? In that case could you provide us with a setup so we can reproduce?

jexp commented 8 months ago

you can test it on https://demo.neo4jlabs.com:7473/browser/ username/password/database is recommendations

The original idea was to have an easy way to select which properties to return from a node to a client UI while keeping the functionality that you have when returning regular nodes. This is meant to exclude heavy or irrelevant properties like large texts, embeddings, large number of properties and focus on the relevant properties for captions and querying so that graphs can be visualized without overloading the UI.

https://github.com/neo4j-contrib/neo4j-apoc-procedures/issues/148

So wrapping a node .i.e. keeping it's identity and then limiting which properties are available.

As I recently discovered the "keeping the identity" part was not implemented, so that the wrapped nodes are not tied to the original node or its relationships, which would have to be awkwardly reconstructed.

Performance comparison for the query below, which returns roughly 34 nodes and 37 rels is 10 seconds vs 10ms for the approach excluding the huge properties. It fetches two embedding and one large text property per node.

// 10s+
call db.index.fulltext.queryNodes("movieFulltext","Forrest Gump", {limit:1}) yield node as n, score as s1
call db.index.vector.queryNodes("moviePlotsEmbedding",5, n.plotEmbedding) yield node as movie, score
match (person:Person)-[r1]->(movie)-[r2:IN_GENRE]->(genre)
return *

Here is the example of how it should have worked:

// doesn't work, nodes are disconnected from relationships
call db.index.fulltext.queryNodes("movieFulltext","Forrest Gump", {limit:1}) yield node as n, score as s1
call db.index.vector.queryNodes("moviePlotsEmbedding",5, n.plotEmbedding) yield node as movie, score
match (person:Person)-[r1]->(movie)-[r2:IN_GENRE]->(genre)

return genre, r1, r2, apoc.create.virtual.fromNode(movie, ["title","released","url"]) as movie,
 apoc.create.virtual.fromNode(person, ["name","born","url"]) as person

and what you actually have to do now to make it work

// 10ms
call db.index.fulltext.queryNodes("movieFulltext","Forrest Gump", {limit:1}) yield node as n, score as s1
call db.index.vector.queryNodes("moviePlotsEmbedding",5, n.plotEmbedding) yield node as movie, score
match path = (person:Person)-[rp]->(movie)-[rg:IN_GENRE]->(genre)

with collect(path) as paths
call apoc.graph.fromPaths(paths,"results",{}) yield graph
with graph.nodes as nodes, graph.relationships as rels
with rels, apoc.map.fromPairs([n in nodes | [coalesce(n.tmdbId, n.name), apoc.create.vNode(labels(n),n {.*, plotEmbedding:null, posterEmbedding:null, plot:null, bio:null })]]) as nodes
return nodes, [r in rels | apoc.create.vRelationship(nodes[coalesce(startNode(r).tmdbId,startNode(r).name)], type(r), properties(r), nodes[coalesce(endNode(r).tmdbId,endNode(r).name)])] as rels
Lojjs commented 8 months ago

@gem-neo4j did some digging for me and it seems like this was implemented with positive node ids earlier but changed because of a bug where changes to the virtual node were updating the real one: https://github.com/neo4j-contrib/neo4j-apoc-procedures/issues/2252

jexp commented 8 months ago

Yeah but that was intentional that it is a wrapper of the real node so all operations on it should pass through. Perhaps the naming with "virtual" was not that good it should have been "wrapped".