monarch-initiative / helpdesk

The Monarch Initiative Helpdesk
BSD 3-Clause "New" or "Revised" License
7 stars 0 forks source link

Discrepancies between TSV and Monarch-KG Data #131

Open nickzren opened 2 months ago

nickzren commented 2 months ago

I have downloaded data from Monarch Initiative (https://data.monarchinitiative.org/), specifically from the following sub sources:

Observations:

Example records present in TSV but not in Monarch-KG: subject subject_label subject_taxon subject_taxon_label object object_label relation relation_label evidence evidence_label source is_defined_bqualifier
HGNC:10004 RGS9 NCBITaxon:9606 Homo sapiens MONDO:0012033 bradyopsia RO:0003303 causes condition ECO:0000322 ECO:0000220 imported manually asserted information used in automatic assertion sequencing assay evidence https://archive.monarchinitiative.org/#omim https://archive.monarchinitiative.org/#orphanet direct
HGNC:10008 RHCE NCBITaxon:9606 Homo sapiens MONDO:0019107 Rh deficiency syndrome RO:0004013 is causal germline mutation in ECO:0000322 imported manually asserted information used in automatic assertion https://archive.monarchinitiative.org/#orphanet direct
Example records present in Monarch-KG but not in TSV: subject subject_label subject_category subject_taxon subject_taxon_label negated predicate object object_label object_category qualifiers publications has_evidence primary_knowledge_source aggregator_knowledge_source object_taxon object_taxon_label object_taxon:1 object_taxon_label:1
HGNC:10001 RGS5 biolink:Gene NCBITaxon:9606 Homo sapiens biolink:causes MONDO:0007781 essential hypertension, genetic biolink:Disease infores:omim infores:monarchinitiative infores:medgen
HGNC:10004 RGS9 biolink:Gene NCBITaxon:9606 Homo sapiens biolink:causes MONDO:0958180 prolonged electroretinal response suppression 1 biolink:Disease infores:omim infores:monarchinitiative infores:medgen

Request:

Could you please provide clarification on why these discrepancies exist between the TSV and Monarch-KG data? Assuming the data from Monarch-KG is more accurate.?

kevinschaper commented 2 months ago

Hi @nickzren,

The tsv files for the new KG build are here: https://data.monarchinitiative.org/monarch-kg/latest/tsv/index.html

I've been a little hesitant to move artifacts from the old build, which were put at the root of data.monarchinitiative.org out of concern for breaking anyone's pipelines, but the confusion that you ran into is exactly why I should move them.

data.monarchinitiative.org/latest & data.monarchinitiative.org/YYYYMM are the directories from the old build that we should move down/aside to reduce confusion, data.monarchinitiative.org/monarch-kg/latest is the release from the new pipeline - and hopefully the tsv files from within a new release will match the full kg tsv! (plus or minus columns, download files for specific categories should only have relevant columns for that association category)

nickzren commented 2 months ago

Ok, I see. So if I need to use the data from now on, it should always rely on data from data.monarchinitiative.org/monarch-kg/latest since it’s from the new pipeline, rather than data from data.monarchinitiative.org/latest which is from the old pipeline. Correct?

kevinschaper commented 2 months ago

That’s right, and I’m going to make it less confusing by moving the old files down to a subdirectory.

nickzren commented 2 months ago

Very helpful, thanks a lot @kevinschaper !