Combine two batch queries for node relationship creation

robobenklein commented 3 years ago

Currently two queries are used in order to keep a mapping from the nonunique order to the unique nodeid property for insertion, though I feel like there should be some way to bring results from a prior UNWIND to another in the same query.

# milliseconds timeout
@neo4j.unit_of_work(timeout=30 * 60 * 1000)
def managed_batch_insert(tx, entries: list, order_to_id_o: dict) -> dict:
    order_to_id = order_to_id_o.copy() # stay pure until tx successful
    qi = """
    unwind $entries as data
    match (f:WSTFile)
    where id(f) = data.fileid
    create (f)<-[:IN_FILE]-(c:WSTNode {
        x1: data.x1, x2: data.x2, y1: data.y1, y2: data.y2,
        named: data.named, type: data.type, preorder: data.preorder
    })-[:CONTENT]->(t:WSTText {
        length: data.textlength,
        text: data.text
    })
    return c.preorder as preorder, id(c) as cid, data.parentorder as parentorder order by c.preorder
    """
    nresults = tx.run(qi, {"entries": entries})

    nvals = nresults.data()

    # use a dict for constant-time order to nodeid:
    # create a list of all the parent connections we need to make:
    rp_list = []
    for v in nvals:
        order_to_id[v['preorder']] = v['cid']
        rp_list.append({
            "cid": v['cid'],
            "pid": order_to_id[v['parentorder']],
        })

    qr = """
    unwind $connectlist as ctpi
    match (c:WSTNode), (p:WSTNode)
    where id(c) = ctpi.cid and id(p) = ctpi.pid
    create (c)-[r:PARENT]->(p)
    return id(c) as cid, id(p) as pid, id(r) as rid order by c.preorder
    """
    rresults = tx.run(qr, {"connectlist": rp_list})

    rvals = rresults.data()
    return order_to_id

Challenges that are blocking me:

WSTNodes in this batch could refer to nodes not created within this query (but always will be in the same file)
Unsure how to create a mapping between order and ID within the statement (does cypher support modifying a dict between statements?)
Want to avoid using anything but constant-time insert of the node, since we cannot scan 700TB of data for a search each query

robobenklein commented 3 years ago

@jexp I saw your comment on my post (thanks!), what are your thoughts here?

aneeskA commented 3 years ago

@robobenklein I was reading upon the challenges about using neo4j and landed upon your blog post and this issue. Can you please share why this was closed?

robobenklein commented 3 years ago

Originally it was likely because I could not find a better solution than using two queries, as the primary load on the system was by the database while the collector / python had plenty of CPU time available, so doing the mapping in python cost very little compared to the extra time the database spent doing the unwind, since we were already hitting limits on the performance of Neo4j at the time.

If I were to close this Issue today it would likely be for the following reasons:

I've restructured the output from the collector to write to intermediate files, allowing the usage of batch / bulk imports into a DB instead of writing DB-specific queries for insertion
We stopped using Neo4j for the larger datasets since we have yet to be able to get past the scalability problem (as we are trying to store more than half a petabyte of graph data on a single system / storage array)
- We are currently working with ArangoDB, but we are open to other pros and cons: https://github.com/utk-se/WorldSyntaxTree/discussions/37

aneeskA commented 3 years ago

@robobenklein thank you so much for this. much appreciated.

utk-se / WorldSyntaxTree

Combine two batch queries for node relationship creation #11