monarch-initiative / omim

Data ingest pipeline for OMIM.
7 stars 3 forks source link

Create a complete HGNC-OMIM SSSOM mapping file #44

Closed matentzn closed 2 years ago

matentzn commented 2 years ago

We need a complete mapping (covering all genes, not just the genes in omim.ttl in case these are restricted to the disease relevant ones) of all HGNC-OMIM genes. We should output an sssom file and attach it to the release, the same way as omim.ttl is attached.

joeflack4 commented 2 years ago

Alright, so I looked into this, and I can obtain these mappings from one or both of two places: i. OMIM's mim2gene.txt (we've been using this in the ingest so far; has a column for HGNC mappings, but have not thus far utilized these in the generated omim.ttl) ii. OMIM's genemap2.txt (haven't used in the ingest yet, but has a column for HGNC mappings)

I plan to use both of these and, if I notice any inconsistencies in these mappings between the two files, I will make an error report.

Questions/Issues

  1. Which prefix URL to use? I found (i) https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/HGNC: (example: https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/HGNC:404), and (ii) https://www.alliancegenome.org/gene/HGNC: (example: https://www.alliancegenome.org/gene/HGNC:404)
  2. Do I need to map OMIM terms to HGNC (a) symbols, or OMIM terms to HGNC (b) IDs? I didn't actually know there was a difference until just now. If you look at this example, you can see that the HGNC symbol ALDH2 has ID HGNC:404. If I need to go with (b), I'll need to figure out how I'm going to effectively map between these symbols and IDs, using web scraping as a last resort. If you have any recommendations, let me know.
matentzn commented 2 years ago

Sounds great.

  1. Always check bioregistry: https://bioregistry.io/registry/hgnc. Mondo used https://identifiers.org/hgnc:16793 so feel free to use that for now. I may change my mind. :P
  2. I don't know the difference, didn't know there was one! Can you ask in the kg-hub-n-data group in the Monarch slack space?
  3. No web scraping.
joeflack4 commented 2 years ago

Thanks! (1) Ok, I'll use https://identifiers.org/hgnc: as my CURIE prefix for HGNC. I can also do a quick PR to add it to https://github.com/monarch-initiative/mondo/blob/master/src/ontology/metadata/mondo.sssom.config.yml (2) Sure thing (3) Fine w/ me. But if it is recommended that I use the IDs instead of the symbols, I'll need to find some way to map them.

matentzn commented 2 years ago

Just in case I forgot to mention, it would be create to see the omim-hgnc.monarch.sssom.tsv file attached to the release: https://github.com/monarch-initiative/omim/releases/tag/latest

joeflack4 commented 2 years ago

You did mention that, and it's on my task list. But I did just think of a question, now that you mention this (I'll ask in today's meeting):

UMLS-OMOM && HGNC-OMIM Mapping files: which mappings?

joeflack4 commented 2 years ago

Results in latest release. I think I still need to split the file, though: https://github.com/monarch-initiative/omim/releases/tag/latest