monarch-initiative / monarch-ingest

Data ingest application for Monarch Initiative knowledge graph using Koza
https://monarchinitiative.org
15 stars 2 forks source link

Duplicate edges in merged KG #420

Closed kevinschaper closed 1 year ago

kevinschaper commented 1 year ago

@oneilsh Noticed that we have duplicate edges (meaning, with the same generated UUID) in our merged graph.

  count                          value
1  9055    goa_go_annotation_edges.tsv
2  7692 string_protein_links_edges.tsv

I'm not sure yet if this is a merge bug or an ingest bug.

amc-corey-cox commented 1 year ago

I haven't been able to reproduce this. I'll reach out via Slack

amc-corey-cox commented 1 year ago

Okay, the transform_output is all fine but we get the duplicates in monarch_kg_edges.tsv so the issue is in the merge.

sort monarch-kg_edges.tsv | cut -f 1 | uniq -d

Kevin thinks this is in the code to apply mappings

"ENSEMBL:ENSRNOG00000004391 has mappings to both RGD:631370 and RGD:15003279 because I did it all with joins, it's just duplicating the row..."

"I wonder if this might end up being an argument for checking out diopt for orthology instead of panther"

amc-corey-cox commented 1 year ago

We're fairly confident that when we apply the mappings and there are nodes with multi-mapping, this causes a duplication of the edge id. I see a few approaches. 1) We could take the first mapping without regard to any score or judgment of the "best" mapping. 2) We could select a mapping based on any sort of ranking. 3) We could use both mappings and change the edge ID.

Regardless of our choice of these options, we should make this choice a setting and allow leaving the duplicates, unless we change the incoming data so we no longer have multi-mapping.

amc-corey-cox commented 1 year ago

I'm fixing this in cat-merge here: https://github.com/monarch-initiative/cat-merge/pull/50

amc-corey-cox commented 1 year ago

Kevin and I decided cat-merge shouldn't handle the logic for this. We'll just exclude all duplicates and report them as a QC issue. Fixing duplicates caused by mapping needs to be done upstream, either in a separate mapping tool or in the mapping inputs.

amc-corey-cox commented 1 year ago

This was closed in cat-merge. We went with the simple solution of dropping duplicates and reporting in the QC report. We should probably add a new issue for dealing with the specific duplicates we're finding in the QC.