monarch-initiative / monarch-semantic-similarity-profiles

MIT License
2 stars 0 forks source link

Phenotypic similarity based on g2d, g2p _across_ SSPOs #15

Closed souzadevinicius closed 6 months ago

souzadevinicius commented 9 months ago

Right now we only pull in gene associations from HPOA. This means, we only have HP->Gene associations. However, to facilitate gene-level semantic similarity between HP and MP, we need to have MP->Gene associations and Genetic orthologue relations as well.

@kevinschaper can you help us here wrt.

How would you compare two phenotypic profiles, one MP, one HP, only along their p2g associations?

matentzn commented 8 months ago

@souzadevinicius I updated the text for this issue!

matentzn commented 8 months ago

cc @kevinschaper

kevinschaper commented 8 months ago

I'm working on subsetted tsv output this week, so soon I'll have an "all gene to phenotype" file or a "just mouse gene to phenotype" file.

I don't know how it would mix in to your pipeline, but the phenio.db that's distributed with monarch-kg builds has all of the phenotype associations populated in the term_association table.

It also might be practical (if ugly) to grep from the tsv:

zcat monarch-kg-denormalized-edges.tsv.gz | grep -e '^id' -e 'has_phenotype' | grep 'MGI:' | grep 'MP:'

or a bit more structured by using the sqlite artifact:

sqlite3 monarch-kg.db -cmd ".mode tabs" -cmd ".headers on"  "select * from denormalized_edges where subject_taxon = 'NCBITaxon:10090' and predicate = 'biolink:has_phenotype'" > mgi_g2p.tsv
matentzn commented 8 months ago

@kevinschaper how did the phenio build end up with all of those associations? is this phenio + all of monarch KG? Any documentation (or issue) about this, and a location where to get it?

I also learned something else today from @cmungall. @souzadevinicius we should typically not use Jaccard for comparing phenotypic profiles with gene associations. It seems that we should be using the "Resnik score" only, which somehow (the details are hazy) takes these scores into account by leveraging Information Content. Can you, for our next meeting:

  1. Understand how semsimian computes resnik (formula, how it works, with an example)
  2. Check if the resnik scores look somehow "sane" when using semsimian?
kevinschaper commented 8 months ago

Sorry, not great documentation. Here is the issue and where it happens in the code: https://github.com/monarch-initiative/monarch-ingest/blob/d65456eb3667a47d960e21f766b1ec65f1b4f774/scripts/load_sqlite.sh#L36

Semsimian uses phenio.db directly, and the semantic sql schema has a table for term associations, so it made sense to supply the associations that way...except that it's obviously a bit weird and circular to populate monarch-kg with phenio and then populate phenio with associations from monarch-kg. So the solution we have right now is that a kg release includes a phenio.db with these extra associations.

kevinschaper commented 8 months ago

Oh, kg artifacts available at: http://data.monarchinitiative.org/monarch-kg/latest/ (purl still to come, of course...)

matentzn commented 8 months ago

@souzadevinicius can you try to use phenio.db as located in the URL @kevinschaper provided?

cmungall commented 8 months ago

To clarify: I didn't intend to suggest not using jaccard, I just pointed out that jaccard is between a pair of phenotype terms is computed using the ontology only, the associations make no difference.

IC can be calculated with different corpuses: just the ontology, ontology + associations.

matentzn commented 6 months ago

This issue has been resolved - we now understand that the gene associations were meant to affect only the IC score, and not the Jaccard score. Thanks everyone for your help!