QC Stats on merged kgx files

putmantime commented 2 years ago

We would like to report the following metrics:

[ ] CURIE prefixes that we have no nodes for
[ ] Unique Nampespaces (set of CURIES in our graph)
[ ] Counts: association type, triple type, node category
[ ] Per ingest stats: counts by category, list of prefix, biolink schema

Two scenarios:

Before we merge in ontologies
After we merge in ontologies

Jupyter notebook that pulls in Nodes and edges files and reports above metrics. Needs to run on the most recent dated directory. This data lives on the monarch-ingest google bucket.

Lets look at google colab notebook for developing this.

kevinschaper commented 2 years ago

As a first pass, let's make a file with these columns

edge_file_name, count of total edges, count of edges with missing subject or object

kevinschaper commented 2 years ago

I added myself to this, I'm working in the PR to connect the report to the rest of the code

victoriasoesanto commented 2 years ago

@kevinschaper I have filled out the functions that we wrote last week but i have not updated the PR, would you like me to update it first before you connect the rest of the code?

monarch-initiative / monarch-ingest

QC Stats on merged kgx files #211