Closed kevinschaper closed 1 year ago
I haven't been able to reproduce this. I'll reach out via Slack
Okay, the transform_output is all fine but we get the duplicates in monarch_kg_edges.tsv so the issue is in the merge.
sort monarch-kg_edges.tsv | cut -f 1 | uniq -d
Kevin thinks this is in the code to apply mappings
"ENSEMBL:ENSRNOG00000004391 has mappings to both RGD:631370 and RGD:15003279 because I did it all with joins, it's just duplicating the row..."
"I wonder if this might end up being an argument for checking out diopt for orthology instead of panther"
We're fairly confident that when we apply the mappings and there are nodes with multi-mapping, this causes a duplication of the edge id. I see a few approaches. 1) We could take the first mapping without regard to any score or judgment of the "best" mapping. 2) We could select a mapping based on any sort of ranking. 3) We could use both mappings and change the edge ID.
Regardless of our choice of these options, we should make this choice a setting and allow leaving the duplicates, unless we change the incoming data so we no longer have multi-mapping.
I'm fixing this in cat-merge here: https://github.com/monarch-initiative/cat-merge/pull/50
Kevin and I decided cat-merge shouldn't handle the logic for this. We'll just exclude all duplicates and report them as a QC issue. Fixing duplicates caused by mapping needs to be done upstream, either in a separate mapping tool or in the mapping inputs.
This was closed in cat-merge. We went with the simple solution of dropping duplicates and reporting in the QC report. We should probably add a new issue for dealing with the specific duplicates we're finding in the QC.
@oneilsh Noticed that we have duplicate edges (meaning, with the same generated UUID) in our merged graph.
I'm not sure yet if this is a merge bug or an ingest bug.