neo4j / graph-data-science

Source code for the Neo4j Graph Data Science library of graph algorithms.
https://neo4j.com/docs/graph-data-science/current/
Other
621 stars 160 forks source link

any way to accelerate the process of exporting the embedding in Node2Vec #148

Closed kinpoon-sn closed 2 years ago

kinpoon-sn commented 2 years ago

Currently, i'm using node2vec to generate node embedding through python client. There are about 1 billion edges and 0.1 billion nodes. From the log, i could see the node2vec process take about 1 hour, but the print process takes more than 10 hours. Is there any way to accelerate the process?

query = '''
        CALL gds.beta.node2vec.stream(
          {
            nodeProjection: "*",
            relationshipProjection: "*",
            embeddingDimension: 128,
            randomSeed: 42,
            concurrency: 4,
            walkLength: 10,
            iterations: 3,
            relationshipProperties: ["weight"],
            relationshipWeightProperty: "weight"
          }
        )
        YIELD nodeId, embedding
        RETURN gds.util.asNode(nodeId).nodeId AS name, embedding, labels(gds.util.asNode(nodeId)) AS labels
    '''
with driver.session(database="neo4j") as session:
    for record in session.run(query):
        print(f'''{record['name']}\t{record['embedding']}''')
AliciaFrame commented 2 years ago

Hi @kinpoon-sn!

To speed up export, you could try:

1) Using a named graph + mutation, and then exporting to CSV using call gds.beta.graph.export.csv
The process would look like CALL gds.graph.create('my-graph','*','*') CALL gds.beta.node2vec('my-graph',{...}) CALL gds.beta.graph.export('my-graph',{'neo4j-embeddings.csv'})

or

2) Using the experimental Neo4j Arrow plugin for data transfer: https://github.com/neo4j-field/neo4j-arrow neo4j-arrow is much faster than the python driver for large data set transfers, but its currently an experimental plug in, and under active development.

kinpoon-sn commented 2 years ago

@AliciaFrame BIG THANKS!!