monarch-initiative / monarch-ingest

Data ingest application for Monarch Initiative knowledge graph using Koza
https://monarchinitiative.org
14 stars 1 forks source link

Add HPOA gene to disease ingest (and remove OMIM g2d) #444

Closed kevinschaper closed 1 year ago

kevinschaper commented 1 year ago

We have HPO's g2d file sitting in our data-cache now (thanks @iimpulse!) and we can swap it in for our OMIM g2d ingest

The file looks like (taken from a few places to get examples of all 3 association types)

ncbi_gene_id    gene_symbol     association_type        disease_id      source
NCBIGene:64170  CARD9   MENDELIAN       OMIM:212050     ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/mim2gene_medgen
NCBIGene:51256  TBC1D7  MENDELIAN       OMIM:248000     ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/mim2gene_medgen
NCBIGene:28981  IFT81   MENDELIAN       OMIM:617895     ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/mim2gene_medgen
NCBIGene:8216   LZTR1   MENDELIAN       OMIM:616564     ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/mim2gene_medgen
NCBIGene:6505   SLC1A1  POLYGENIC       OMIM:615232     ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/mim2gene_medgen
NCBIGene:4750   NEK1    POLYGENIC       OMIM:617892     ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/mim2gene_medgen
NCBIGene:25913  POT1    POLYGENIC       OMIM:616568     ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/mim2gene_medgen
NCBIGene:200942 KLHDC8B POLYGENIC       OMIM:236000     ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/mim2gene_medgen
NCBIGene:55687  TRMU    POLYGENIC       OMIM:580000     ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/mim2gene_medgen
NCBIGene:3265   HRAS    UNKNOWN ORPHA:79414     http://www.orphadata.org/data/xml/en_product6.xml
NCBIGene:472    ATM     UNKNOWN ORPHA:370109    http://www.orphadata.org/data/xml/en_product6.xml
NCBIGene:26191  PTPN22  UNKNOWN ORPHA:397       http://www.orphadata.org/data/xml/en_product6.xml
NCBIGene:3106   HLA-B   UNKNOWN ORPHA:397       http://www.orphadata.org/data/xml/en_product6.xml
NCBIGene:3123   HLA-DRB1        UNKNOWN ORPHA:397       http://www.orphadata.org/data/xml/en_product6.xml

ncbi_gene_id

No change necessary, just pass through as-is (and it will be mapped to HGNC later)

gene_symbol

Not used

association_type

values: 6573 MENDELIAN 621 POLYGENIC 8158 UNKNOWN

We should check with @sabrinatoro on predicate mappings.

my partial guess: MENDELIAN: biolink:affects_risk_for UNKNOWN: biolink:gene_associated_with_condition POLYGENIC: ???

disease_id

prefixes: 7194 OMIM 8158 ORPHA

Need to replace("ORPHA:", "Orphanet:") to match the curie in MONDO sssom

source

The source column values are

7194 ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/mim2gene_medgen
8158 http://www.orphadata.org/data/xml/en_product6.xml

We should map these to infores as the primary_knowledge_source (maybe infores:omim for the medgen, then medgen as an aggregator? - plus infores:orphanet)

Also add aggregating_knowledge_source that includes infores:monarchinitiative, infores:hpo-annotations, and probably infores:medgen

download & file path

The file is available in the data/hpoa/genes_to_disease.txt in gs://monarch-ingest-data-cache already, it won't be available from the HPO site until the next release, but we should still add a commented out entry in download.yaml and make a second issue to enable the download later.

sabrinatoro commented 1 year ago

This is very confusing because 4 out of 5 of the "POLYGENIC" examples were associated with only one gene. That being said, using the predicate "biolink:gene_associated_with_condition" is broad enough. It means there is some association between the gene and the disease. Therefore, I think we can use it in all cases.

We could probably get more specific for the "MENDELIAN" one (using RO:0003303, causes condition). But I don't see an existing biolink predicate that represents it (it doesn't mean it doesn't exist)

I hope that helps!

kevinschaper commented 1 year ago

Oof, I just tracked backwards to figure out where the risk_affected_by predicate came from. It was commented out for OMIM { disorder labels along with a contributes to relation lookup from the translation table. When I updated the OMIM ingest to set the RO term based on the spreadsheet, I kept the predicate. Then the relation field went away in favor of only using predicates, and we only had that predicate left. I'm very glad we're dealing with this!

kevinschaper commented 1 year ago

The predicate mapping ended up being:

kevinschaper commented 1 year ago

This ingest is included and looking good in the brand new 2023-05-03 release.

The list of dangling edges where the gene didn't connect is pretty small:

original_subject subject predicate object original_object
NCBIGene:111365204 biolink:causes MONDO:0007630 OMIM:136550
NCBIGene:111365204 biolink:causes MONDO:0010932 OMIM:600790
NCBIGene:105259599 biolink:causes MONDO:0020796 OMIM:180860
NCBIGene:109580095 biolink:causes MONDO:0013517 OMIM:613985
NCBIGene:105259599 biolink:causes MONDO:0007534 OMIM:130650
NCBIGene:10108 biolink:causes MONDO:0008300 OMIM:176270
NCBIGene:7467 biolink:causes MONDO:0008684 OMIM:194190
NCBIGene:105259599 biolink:causes MONDO:0008680 OMIM:194071

A bit larger for diseases that we don't have an entity for

original_subject subject predicate object original_object
NCBIGene:3662 HGNC:6119 biolink:causes OMIM:611724
NCBIGene:434 HGNC:745 biolink:causes OMIM:611742
NCBIGene:7471 HGNC:12774 biolink:contributes_to OMIM:615221
NCBIGene:977 HGNC:1630 biolink:causes OMIM:179620
NCBIGene:4254 HGNC:6343 biolink:causes OMIM:611664
NCBIGene:5358 HGNC:9091 biolink:causes OMIM:300910
NCBIGene:124872 HGNC:24136 biolink:causes OMIM:615018
NCBIGene:3162 HGNC:5013 biolink:contributes_to OMIM:606963
NCBIGene:4312 HGNC:7155 biolink:causes OMIM:606963
NCBIGene:8706 HGNC:918 biolink:causes OMIM:615021
NCBIGene:6906 HGNC:11583 biolink:causes OMIM:300932
NCBIGene:3990 HGNC:6619 biolink:causes OMIM:612797
NCBIGene:960 HGNC:1681 biolink:causes OMIM:609027
NCBIGene:4018 HGNC:6667 biolink:contributes_to OMIM:618807
NCBIGene:4157 HGNC:6929 biolink:causes OMIM:613098
NCBIGene:9780 HGNC:28993 biolink:causes OMIM:620207
NCBIGene:10551 HGNC:328 biolink:causes MONDO:0859370 OMIM:620233
NCBIGene:84466 HGNC:29634 biolink:causes MONDO:0859515 OMIM:620249
NCBIGene:779 HGNC:1397 biolink:causes MONDO:0859514 OMIM:620246
NCBIGene:3848 HGNC:6412 biolink:causes MONDO:0859574 OMIM:620148
NCBIGene:153 HGNC:285 biolink:causes OMIM:607276
NCBIGene:7125 HGNC:11944 biolink:causes MONDO:0859335 OMIM:620161
NCBIGene:9570 HGNC:4431 biolink:causes MONDO:0859336 OMIM:620166
NCBIGene:55719 HGNC:17814 biolink:causes MONDO:0859575 OMIM:620184
NCBIGene:23137 HGNC:20465 biolink:causes MONDO:0859576 OMIM:620185
NCBIGene:2255 HGNC:3666 biolink:causes MONDO:0859578 OMIM:620193
NCBIGene:2261 HGNC:3690 biolink:causes MONDO:0859577 OMIM:620192
NCBIGene:23286 HGNC:29435 biolink:causes OMIM:615602
NCBIGene:84946 HGNC:21173 biolink:causes MONDO:0859355 OMIM:620199
NCBIGene:55969 HGNC:15870 biolink:causes MONDO:0859567 OMIM:616994
NCBIGene:720 HGNC:1323 biolink:causes OMIM:614374
NCBIGene:949 HGNC:1664 biolink:causes OMIM:610762
NCBIGene:345275 HGNC:18685 biolink:contributes_to OMIM:620116
NCBIGene:90523 HGNC:21355 biolink:causes MONDO:0859322 OMIM:620138
NCBIGene:8854 HGNC:15472 biolink:causes MONDO:0859571 OMIM:620025
NCBIGene:118987 HGNC:26974 biolink:causes MONDO:0859281 OMIM:620021
NCBIGene:55107 HGNC:21625 biolink:causes MONDO:0859289 OMIM:620045
NCBIGene:171019 HGNC:17111 biolink:causes MONDO:0859572 OMIM:620067
NCBIGene:3911 HGNC:6485 biolink:causes MONDO:0859573 OMIM:620076
NCBIGene:506 HGNC:830 biolink:causes MONDO:0859302 OMIM:620085
NCBIGene:865 HGNC:1539 biolink:causes MONDO:0859307 OMIM:620099
NCBIGene:3077 HGNC:4886 biolink:causes OMIM:614193
NCBIGene:2556 HGNC:4077 biolink:causes MONDO:0859564 OMIM:301091
NCBIGene:2157 HGNC:3546 biolink:causes MONDO:0859082 OMIM:301071
NCBIGene:2532 HGNC:4035 biolink:causes OMIM:611862
NCBIGene:3047 HGNC:4831 biolink:causes OMIM:141749
NCBIGene:3048 HGNC:4832 biolink:causes OMIM:141749
NCBIGene:3043 HGNC:4827 biolink:causes OMIM:141749
NCBIGene:50833 HGNC:14921 biolink:causes OMIM:617956
NCBIGene:29881 HGNC:7898 biolink:causes OMIM:617966
NCBIGene:6006 HGNC:10008 biolink:causes OMIM:617970
NCBIGene:55366 HGNC:13299 biolink:contributes_to OMIM:615311
NCBIGene:3615 HGNC:6053 biolink:causes OMIM:617995
NCBIGene:55366 HGNC:13299 biolink:causes MONDO:0859205 OMIM:619613
NCBIGene:4087 HGNC:6768 biolink:causes MONDO:0859213 OMIM:619657
NCBIGene:8482 HGNC:10741 biolink:causes OMIM:614745
NCBIGene:3570 HGNC:6019 biolink:causes OMIM:614752
NCBIGene:2646 HGNC:4196 biolink:causes OMIM:613463
NCBIGene:57498 HGNC:29508 biolink:causes MONDO:0859184 OMIM:619501
NCBIGene:5290 HGNC:8975 biolink:causes MONDO:0859192 OMIM:619538
NCBIGene:3570 HGNC:6019 biolink:causes OMIM:614689
NCBIGene:285498 HGNC:27729 biolink:causes OMIM:612042
NCBIGene:338557 HGNC:19061 biolink:contributes_to OMIM:607514
NCBIGene:1136 HGNC:1957 biolink:contributes_to OMIM:612052
NCBIGene:1138 HGNC:1959 biolink:contributes_to OMIM:612052
NCBIGene:3773 HGNC:6262 biolink:causes MONDO:0859167 OMIM:619406
NCBIGene:6809 HGNC:11438 biolink:causes MONDO:0859170 OMIM:619446
NCBIGene:6007 HGNC:10009 biolink:contributes_to MONDO:0859172 OMIM:619462
NCBIGene:51474 HGNC:24636 biolink:causes OMIM:618079
NCBIGene:1378 HGNC:2334 biolink:causes OMIM:607486
NCBIGene:360 HGNC:636 biolink:causes OMIM:607457
NCBIGene:7351 HGNC:12518 biolink:contributes_to OMIM:607447
NCBIGene:51129 HGNC:16039 biolink:causes OMIM:615881
NCBIGene:1317 HGNC:11016 biolink:causes OMIM:620306
NCBIGene:3032 HGNC:4803 biolink:causes OMIM:620300
NCBIGene:84699 HGNC:18855 biolink:causes MONDO:0859149 OMIM:619324
NCBIGene:1289 HGNC:2209 biolink:causes MONDO:0859151 OMIM:619329
NCBIGene:26175 HGNC:20218 biolink:causes MONDO:0859156 OMIM:619345
NCBIGene:1558 HGNC:2622 biolink:contributes_to OMIM:618018
NCBIGene:1066 HGNC:1863 biolink:causes OMIM:618057
NCBIGene:347734 HGNC:16872 biolink:causes MONDO:0859518 OMIM:620269
NCBIGene:58 HGNC:129 biolink:causes MONDO:0859517 OMIM:620265
NCBIGene:58 HGNC:129 biolink:causes MONDO:0859523 OMIM:620278
NCBIGene:3604 HGNC:11924 biolink:causes MONDO:0859526 OMIM:620282
NCBIGene:3055 HGNC:4840 biolink:causes OMIM:620296
NCBIGene:23129 HGNC:9107 biolink:causes MONDO:0859532 OMIM:620294
NCBIGene:2805 HGNC:4432 biolink:causes OMIM:614419
NCBIGene:9429 HGNC:74 biolink:causes OMIM:614490
NCBIGene:10913 HGNC:2895 biolink:causes OMIM:612630
NCBIGene:2524 HGNC:4013 biolink:contributes_to OMIM:612542
NCBIGene:9370 HGNC:13633 biolink:causes OMIM:612556
NCBIGene:7367 HGNC:12547 biolink:contributes_to OMIM:612560
NCBIGene:2492 HGNC:3969 biolink:causes OMIM:276400
NCBIGene:9200 HGNC:9639 biolink:causes MONDO:0859264 OMIM:619967
NCBIGene:7123 HGNC:11891 biolink:causes MONDO:0859568 OMIM:619977
NCBIGene:497661 HGNC:31690 biolink:causes MONDO:0859271 OMIM:619985
NCBIGene:56992 HGNC:17273 biolink:causes MONDO:0859570 OMIM:619981
NCBIGene:54914 HGNC:23377 biolink:causes MONDO:0859273 OMIM:619991
NCBIGene:420 HGNC:726 biolink:causes OMIM:616060
NCBIGene:28 HGNC:79 biolink:causes OMIM:616093
NCBIGene:7299 HGNC:12442 biolink:contributes_to OMIM:601800
NCBIGene:79068 HGNC:24678 biolink:contributes_to OMIM:612460
NCBIGene:1604 HGNC:2665 biolink:causes OMIM:613793
NCBIGene:7289 HGNC:12425 biolink:causes MONDO:0859254 OMIM:619902
NCBIGene:335 HGNC:600 biolink:causes MONDO:0859238 OMIM:619836
NCBIGene:8789 HGNC:3607 biolink:causes MONDO:0859246 OMIM:619864
NCBIGene:10935 HGNC:9354 biolink:causes MONDO:0859248 OMIM:619871
NCBIGene:5122 HGNC:8743 biolink:contributes_to OMIM:612362
NCBIGene:3655 HGNC:6142 biolink:causes MONDO:0859233 OMIM:619817
NCBIGene:54872 HGNC:25985 biolink:causes OMIM:619812
NCBIGene:79639 HGNC:26186 biolink:causes MONDO:0859226 OMIM:619727
NCBIGene:4160 HGNC:6932 biolink:causes OMIM:618406
NCBIGene:51341 HGNC:18078 biolink:causes MONDO:0859231 OMIM:619769
NCBIGene:6521 HGNC:11027 biolink:causes OMIM:601550
NCBIGene:9429 HGNC:74 biolink:causes OMIM:138900
NCBIGene:6521 HGNC:11027 biolink:causes OMIM:601551
NCBIGene:59341 HGNC:18083 biolink:causes OMIM:613508
NCBIGene:6774 HGNC:11364 biolink:causes OMIM:147060
NCBIGene:10661 HGNC:6345 biolink:causes OMIM:613566
NCBIGene:6272 HGNC:11186 biolink:causes OMIM:613589
NCBIGene:219931 HGNC:20820 biolink:causes OMIM:612267
NCBIGene:100128908 HGNC:53647 biolink:causes MONDO:0859222 OMIM:619702
NCBIGene:51780 HGNC:1337 biolink:gene_associated_with_condition MONDO:0858999 Orphanet:633004
NCBIGene:2316 HGNC:3754 biolink:gene_associated_with_condition Orphanet:323
NCBIGene:65109 HGNC:20439 biolink:gene_associated_with_condition Orphanet:323
NCBIGene:8573 HGNC:1497 biolink:gene_associated_with_condition Orphanet:323
NCBIGene:254065 HGNC:17342 biolink:gene_associated_with_condition Orphanet:323
NCBIGene:6558 HGNC:10911 biolink:gene_associated_with_condition Orphanet:633024
NCBIGene:6558 HGNC:10911 biolink:gene_associated_with_condition Orphanet:633021
NCBIGene:1363 HGNC:2303 biolink:gene_associated_with_condition MONDO:0859001 Orphanet:633028
NCBIGene:1315 HGNC:2231 biolink:gene_associated_with_condition MONDO:0859002 Orphanet:633035
NCBIGene:3075 HGNC:4883 biolink:gene_associated_with_condition Orphanet:244275
NCBIGene:3426 HGNC:5394 biolink:gene_associated_with_condition Orphanet:244275
NCBIGene:476 HGNC:799 biolink:gene_associated_with_condition Orphanet:564178
NCBIGene:6906 HGNC:11583 biolink:gene_associated_with_condition Orphanet:209893
NCBIGene:1589 HGNC:2600 biolink:gene_associated_with_condition Orphanet:95698
NCBIGene:410 HGNC:713 biolink:gene_associated_with_condition Orphanet:751
NCBIGene:136371 HGNC:17185 biolink:gene_associated_with_condition Orphanet:353225
NCBIGene:1545 HGNC:2597 biolink:gene_associated_with_condition Orphanet:353225
NCBIGene:134430 HGNC:30696 biolink:gene_associated_with_condition Orphanet:353225
NCBIGene:4909 HGNC:8024 biolink:gene_associated_with_condition Orphanet:353225
NCBIGene:10133 HGNC:17142 biolink:gene_associated_with_condition Orphanet:353225
NCBIGene:51271 HGNC:12461 biolink:gene_associated_with_condition MONDO:0858986 Orphanet:631068
NCBIGene:4159 HGNC:6931 biolink:gene_associated_with_condition Orphanet:217031
NCBIGene:57156 HGNC:23787 biolink:gene_associated_with_condition MONDO:0858992 Orphanet:631088
NCBIGene:7920 HGNC:13921 biolink:gene_associated_with_condition MONDO:0858991 Orphanet:631085
NCBIGene:81790 HGNC:25358 biolink:gene_associated_with_condition MONDO:0858990 Orphanet:631082
NCBIGene:84842 HGNC:28242 biolink:gene_associated_with_condition MONDO:0858988 Orphanet:631076
NCBIGene:5833 HGNC:8756 biolink:gene_associated_with_condition MONDO:0858987 Orphanet:631073
NCBIGene:5297 HGNC:8983 biolink:gene_associated_with_condition MONDO:0858989 Orphanet:631079
NCBIGene:27022 HGNC:3804 biolink:gene_associated_with_condition Orphanet:3435
NCBIGene:22861 HGNC:14374 biolink:gene_associated_with_condition Orphanet:3435
NCBIGene:3762 HGNC:6266 biolink:gene_associated_with_condition Orphanet:85142
NCBIGene:1499 HGNC:2514 biolink:gene_associated_with_condition Orphanet:85142
NCBIGene:776 HGNC:1391 biolink:gene_associated_with_condition Orphanet:85142
NCBIGene:492 HGNC:816 biolink:gene_associated_with_condition Orphanet:85142
NCBIGene:476 HGNC:799 biolink:gene_associated_with_condition Orphanet:85142
NCBIGene:4547 HGNC:7467 biolink:gene_associated_with_condition Orphanet:426
NCBIGene:255738 HGNC:20001 biolink:gene_associated_with_condition Orphanet:426
NCBIGene:29881 HGNC:7898 biolink:gene_associated_with_condition Orphanet:426
NCBIGene:338 HGNC:603 biolink:gene_associated_with_condition Orphanet:426
NCBIGene:27329 HGNC:491 biolink:gene_associated_with_condition Orphanet:426
NCBIGene:345 HGNC:610 biolink:gene_associated_with_condition Orphanet:426
NCBIGene:29116 HGNC:21155 biolink:gene_associated_with_condition Orphanet:426
NCBIGene:3949 HGNC:6547 biolink:gene_associated_with_condition Orphanet:406
NCBIGene:255738 HGNC:20001 biolink:gene_associated_with_condition Orphanet:406
NCBIGene:338 HGNC:603 biolink:gene_associated_with_condition Orphanet:406
NCBIGene:26228 HGNC:24133 biolink:gene_associated_with_condition Orphanet:406
NCBIGene:348 HGNC:613 biolink:gene_associated_with_condition Orphanet:406
NCBIGene:3988 HGNC:6617 biolink:gene_associated_with_condition Orphanet:406

It looks like the mondo terms I have here that I have mapping for but don't actually have the terms are going to show up in the next release. It's exciting that this is tight enough that we're really seeing that the only problems are down to how we synchronize within a month.

kevinschaper commented 1 year ago

This is done