utk-se / WorldSyntaxTree

Language-agnostic parsing of World of Code repositories
Other
20 stars 0 forks source link

Combine two batch queries for node relationship creation #11

Closed robobenklein closed 3 years ago

robobenklein commented 3 years ago

Currently two queries are used in order to keep a mapping from the nonunique order to the unique nodeid property for insertion, though I feel like there should be some way to bring results from a prior UNWIND to another in the same query.

# milliseconds timeout
@neo4j.unit_of_work(timeout=30 * 60 * 1000)
def managed_batch_insert(tx, entries: list, order_to_id_o: dict) -> dict:
    order_to_id = order_to_id_o.copy() # stay pure until tx successful
    qi = """
    unwind $entries as data
    match (f:WSTFile)
    where id(f) = data.fileid
    create (f)<-[:IN_FILE]-(c:WSTNode {
        x1: data.x1, x2: data.x2, y1: data.y1, y2: data.y2,
        named: data.named, type: data.type, preorder: data.preorder
    })-[:CONTENT]->(t:WSTText {
        length: data.textlength,
        text: data.text
    })
    return c.preorder as preorder, id(c) as cid, data.parentorder as parentorder order by c.preorder
    """
    nresults = tx.run(qi, {"entries": entries})

    nvals = nresults.data()

    # use a dict for constant-time order to nodeid:
    # create a list of all the parent connections we need to make:
    rp_list = []
    for v in nvals:
        order_to_id[v['preorder']] = v['cid']
        rp_list.append({
            "cid": v['cid'],
            "pid": order_to_id[v['parentorder']],
        })

    qr = """
    unwind $connectlist as ctpi
    match (c:WSTNode), (p:WSTNode)
    where id(c) = ctpi.cid and id(p) = ctpi.pid
    create (c)-[r:PARENT]->(p)
    return id(c) as cid, id(p) as pid, id(r) as rid order by c.preorder
    """
    rresults = tx.run(qr, {"connectlist": rp_list})

    rvals = rresults.data()
    return order_to_id

Challenges that are blocking me:

robobenklein commented 3 years ago

@jexp I saw your comment on my post (thanks!), what are your thoughts here?

aneeskA commented 3 years ago

@robobenklein I was reading upon the challenges about using neo4j and landed upon your blog post and this issue. Can you please share why this was closed?

robobenklein commented 3 years ago

Originally it was likely because I could not find a better solution than using two queries, as the primary load on the system was by the database while the collector / python had plenty of CPU time available, so doing the mapping in python cost very little compared to the extra time the database spent doing the unwind, since we were already hitting limits on the performance of Neo4j at the time.

If I were to close this Issue today it would likely be for the following reasons:

aneeskA commented 3 years ago

@robobenklein thank you so much for this. much appreciated.