monarch-initiative / omim

Data ingest pipeline for OMIM.
7 stars 3 forks source link

HGNC symbol::id mappings: use better mapping file #55

Open joeflack4 opened 2 years ago

joeflack4 commented 2 years ago

I've been using data/hgnc/hgnc_complete_set.txt, which is provided by EBI. But it is not reliable.

I should use data/hgnc/Homo_sapiens.gene_info, which can be obtained at https://ftp.ncbi.nlm.nih.gov/gene/DATA/GENE_INFO/Mammalia/Homo_sapiens.gene_info.gz. Ideally, I'll want to download a fresh copy of this and unzip it when the ingest runs. I should throw a warning if it fails to download, and use the cached version instead in that case.

Reasons to make this change:

  1. This seems more reliable than the EBI download, because: (i) EBI's FTP server sometimes gives "Bad Gateway" when you try to access it.
  2. It's exactly what OMIM uses.
matentzn commented 2 years ago

For the sake of documentation, can you summarise exactly what makes the one that more reliable then the other? And also any reasons against making this change that come to mind.

joeflack4 commented 2 years ago

Added EBI option unreliability reasoning to OP. Also, to expand on reason (2), I'm assuming that since OMIM uses the HGNC mappings from NCBI, I'm assuming they do this because they find it the best option, rather than as an arbitrary decision. I can't be absolutely sure that the HGNC mappings in this NCBI file are more recent than EBI's, but I think the fact that OMIM uses it makes it more justifiable.

This file from NCBI also has HGNC::OMIM mappings, but I think it best to get those from OMIM. So to recap, the HGNC_ID::OMIM mappings will come from OMIM, and the HGNC_ID::HGNC_Symbol mappings will come from NCBI.

matentzn commented 2 years ago

Ok, I trust your judgement! Thanks for the explanation :)