monarch-initiative / dipper

Data Ingestion Pipeline for Monarch
https://dipper.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
57 stars 26 forks source link

Parse forward slashes in gwas catalog #1006

Open kshefchek opened 3 years ago

kshefchek commented 3 years ago

For example the row:

2020-06-25  25555482    Gelernter J 2014-09-16  Biol Psychiatry www.ncbi.nlm.nih.gov/pubmed/25555482    Genome-wide association study of nicotine dependence in American populations: identification of novel risk loci in both African-Americans and European-Americans.   Nicotine dependence symptom count   3,529 African American individuals, 4,117 European American individuals NA  6p21.32 6   32383959    intergenic  TSBP1-AS1           ENSG00000225914         rs35794310/rs147955325/rs11415565-TG    rs35794310/rs147955325/rs11415565   ...

Per https://www.ncbi.nlm.nih.gov/snp/rs35794310 - rs35794310 was merged with rs11415565 and https://www.ncbi.nlm.nih.gov/snp/rs147955325 - rs147955325 was merged with rs11415565

We should model this similarly to how we model deprecated identifiers in ontologies, but it's unclear from this row alone which identifier is the current one (is it always the last in the list?)

See https://github.com/monarch-initiative/monarch-ui/issues/383

kshefchek commented 3 years ago

According to the docs, if MERGED == 1, we should be using the SNP_ID_CURRENT column

Looks like we already have some support for this: https://github.com/monarch-initiative/dipper/blob/254242e2/dipper/sources/GWASCatalog.py#L450

From the gwas catalog docs:

SNPS*: Strongest SNP; if a haplotype it may include more than one rs number (multiple SNPs comprising the haplotype)

MERGED*: denotes whether the SNP has been merged into a subsequent rs record (0 = no; 1 = yes;)

SNP_ID_CURRENT*: current rs number (will differ from strongest SNP when merged = 1)