Data Sources Stats and Figure

amc-corey-cox commented 9 months ago

We'd like to recapitulate this figure, at least the stats: Screenshot 2024-02-21 at 10 23 34 AM

This will likely require some re-tooling of the reports we use the generate the site. We may also need to re-architect the site a bit to allow for different QC/Stat views.

amc-corey-cox commented 9 months ago

Here is some additional context. Kevin was able to generate this table with the query below:

select category, namespace, count(*) from denormalized_nodes where category in ('biolink:Gene','biolink:Pathway') group by 1,2 having count(*) > 1 order by 1,2;
┌─────────────────┬───────────┬──────────────┐
│    category     │ namespace │ count_star() │
│     varchar     │  varchar  │    int64     │
├─────────────────┼───────────┼──────────────┤
│ biolink:Gene    │ FB        │        30284 │
│ biolink:Gene    │ HGNC      │        43840 │
│ biolink:Gene    │ MGI       │        79680 │
│ biolink:Gene    │ NCBIGene  │       196312 │
│ biolink:Gene    │ PomBase   │         5134 │
│ biolink:Gene    │ RGD       │        57146 │
│ biolink:Gene    │ SGD       │         7153 │
│ biolink:Gene    │ WB        │        48779 │
│ biolink:Gene    │ Xenbase   │        38732 │
│ biolink:Gene    │ ZFIN      │        38000 │
│ biolink:Gene    │ dictyBase │        14222 │
│ biolink:Pathway │ GO        │          645 │
│ biolink:Pathway │ Reactome  │        21441 │
├─────────────────┴───────────┴──────────────┤
│ 13 rows                          3 columns │
└────────────────────────────────────────────┘

Also from Kevin:

Part of why the edge counts get really weird and gross is that we sometimes name the primary source and sometimes name the aggregator

And here is another query and table:

primary_knowledge_source, count(*) from denormalized_edges where category not in ('biolink:Association','biolink:MacromolecularMachineToMolecularActivityAssociation', 'biolink:MacromolecularMachineToCellularComponentAssociation','biolink:MacromolecularMachineToBiologicalProcessAssociation') group by 1 having count(*) > 1 order by
┌──────────────────────────┬──────────────┐
│ primary_knowledge_source │ count_star() │
│         varchar          │    int64     │
├──────────────────────────┼──────────────┤
│ infores:bgee             │       436170 │
│ infores:biogrid          │      1336609 │
│ infores:flybase          │       407615 │
│ infores:hpo-annotations  │       554449 │
│ infores:mgi              │      1066490 │
│ infores:omim             │         7258 │
│ infores:orphanet         │         7997 │
│ infores:panther          │       551383 │
│ infores:pombase          │       168073 │
│ infores:reactome         │       251408 │
│ infores:rgd              │         9696 │
│ infores:sgd              │        16732 │
│ infores:string           │      1422026 │
│ infores:wormbase         │       130283 │
│ infores:xenbase          │         2232 │
│ infores:zfin             │       666695 │
├──────────────────────────┴──────────────┤
│ 16 rows                       2 columns │
└─────────────────────────────────────────┘

kevinschaper commented 9 months ago

It turns out that we actually get clean tables with namespace/prefix for nodes and primary_knowledge_source for edges as long as we're filtering to these categories.

monicacecilia commented 9 months ago

And this one, please!

amc-corey-cox commented 9 months ago

Here is another one.

monicacecilia commented 9 months ago

sorry about the dupe! 🙈

monarch-initiative / monarch-qc

Data Sources Stats and Figure #71