microbiomedata / nmdc-runtime

Runtime system for NMDC data management and orchestration
https://microbiomedata.github.io/nmdc-runtime/
Other
6 stars 3 forks source link

Dagster and referential integrity notebook generate `alldocs` collection in different ways #696

Open eecavanna opened 1 month ago

eecavanna commented 1 month ago

There is some code in the ref integrity notebook that looks similar to your first commit, but batches the write operations to mongo. It is fairly performant (~30 sec with functional annotation) and uses LinkML's native schemaview to get all class ancestors for each doc.

for coll_name in collection_names:
    pbar.set_description(f"processing {coll_name}...")
    requests = []
    for doc in mdb[coll_name].find():
        doc_type = doc_cls(doc, coll_name=coll_name)
        slots_to_include = ["id"] + document_reference_ranged_slots[doc_type]
        new_doc = pick(slots_to_include, doc)
        new_doc["type"] = schema_view.class_ancestors(doc_type)
        requests.append(InsertOne(new_doc))
        if len(requests) == 1000: # ensure bulk-write batches aren't too huge
            result = mdb.alldocs.bulk_write(requests, ordered=False)
            pbar.update(result.inserted_count)
            requests.clear()
    if len(requests) > 0:
        result = mdb.alldocs.bulk_write(requests, ordered=False)
        pbar.update(result.inserted_count)
pbar.close()

Could you use this logic instead? It is fairly similar to your first commit

Originally posted by @PeopleMakeCulture in https://github.com/microbiomedata/nmdc-runtime/issues/694#issuecomment-2364123504

eecavanna commented 1 week ago

As discussed on Slack, I'm thinking of there being a helper function (a.k.a. a "util" function) that generates the alldocs collection (I called this function "compile_alldocs_collection" in the diagram below). The Dagster op and the bulk referential integrity validation notebook would then each "import" that helper function insteading of having their own local implementation (which could fall "out of sync" with the other's over time).

image