Open amc-corey-cox opened 9 months ago
Here is some additional context. Kevin was able to generate this table with the query below:
select category, namespace, count(*) from denormalized_nodes where category in ('biolink:Gene','biolink:Pathway') group by 1,2 having count(*) > 1 order by 1,2;
┌─────────────────┬───────────┬──────────────┐
│ category │ namespace │ count_star() │
│ varchar │ varchar │ int64 │
├─────────────────┼───────────┼──────────────┤
│ biolink:Gene │ FB │ 30284 │
│ biolink:Gene │ HGNC │ 43840 │
│ biolink:Gene │ MGI │ 79680 │
│ biolink:Gene │ NCBIGene │ 196312 │
│ biolink:Gene │ PomBase │ 5134 │
│ biolink:Gene │ RGD │ 57146 │
│ biolink:Gene │ SGD │ 7153 │
│ biolink:Gene │ WB │ 48779 │
│ biolink:Gene │ Xenbase │ 38732 │
│ biolink:Gene │ ZFIN │ 38000 │
│ biolink:Gene │ dictyBase │ 14222 │
│ biolink:Pathway │ GO │ 645 │
│ biolink:Pathway │ Reactome │ 21441 │
├─────────────────┴───────────┴──────────────┤
│ 13 rows 3 columns │
└────────────────────────────────────────────┘
Also from Kevin:
Part of why the edge counts get really weird and gross is that we sometimes name the primary source and sometimes name the aggregator
And here is another query and table:
primary_knowledge_source, count(*) from denormalized_edges where category not in ('biolink:Association','biolink:MacromolecularMachineToMolecularActivityAssociation', 'biolink:MacromolecularMachineToCellularComponentAssociation','biolink:MacromolecularMachineToBiologicalProcessAssociation') group by 1 having count(*) > 1 order by
┌──────────────────────────┬──────────────┐
│ primary_knowledge_source │ count_star() │
│ varchar │ int64 │
├──────────────────────────┼──────────────┤
│ infores:bgee │ 436170 │
│ infores:biogrid │ 1336609 │
│ infores:flybase │ 407615 │
│ infores:hpo-annotations │ 554449 │
│ infores:mgi │ 1066490 │
│ infores:omim │ 7258 │
│ infores:orphanet │ 7997 │
│ infores:panther │ 551383 │
│ infores:pombase │ 168073 │
│ infores:reactome │ 251408 │
│ infores:rgd │ 9696 │
│ infores:sgd │ 16732 │
│ infores:string │ 1422026 │
│ infores:wormbase │ 130283 │
│ infores:xenbase │ 2232 │
│ infores:zfin │ 666695 │
├──────────────────────────┴──────────────┤
│ 16 rows 2 columns │
└─────────────────────────────────────────┘
It turns out that we actually get clean tables with namespace/prefix for nodes and primary_knowledge_source for edges as long as we're filtering to these categories.
And this one, please!
Here is another one.
sorry about the dupe! 🙈
We'd like to recapitulate this figure, at least the stats:
This will likely require some re-tooling of the reports we use the generate the site. We may also need to re-architect the site a bit to allow for different QC/Stat views.