monarch-initiative / monarch-ingest

Data ingest application for Monarch Initiative knowledge graph using Koza
https://monarchinitiative.org
14 stars 1 forks source link

Add a custom map for alliance-genes #204

Closed kevinschaper closed 1 year ago

kevinschaper commented 2 years ago

The Alliance Gene to Phenotype ingest needs to filter for only genes within a file that also contains other object to phenotype associations (Alleles at least). Currently, we're handling this by running a jq command to filter the Alliance as a hardcoded step after the download.

We can remove this step by either adding yaml-only support for building maps from json files to Koza, or by using the custom mapping feature of Koza. Since this is the only instance we've seen so far of wanting make a map from a json file, and since we're mostly de-emphasizing the use of mapping files, I think it would probably make more sense to tackle this as a custom mapping file.

The shell command that we're currently running to make the mapping file is zcat data/alliance/BGI_*.gz | jq '.data[].basicGeneticEntity.primaryId' | gzip > data/alliance/alliance_gene_ids.txt.gz

I believe this means we'd simply need to make a custom map that only adds keys.

The Koza repository has an example of creating a custom map. I think that if a .py file exists to match the .yaml of the map config, the python code will be executed to create the map. For a (non-JSON) example, check out:

https://github.com/monarch-initiative/koza/blob/main/examples/maps/custom-entrez-2-string.yaml https://github.com/monarch-initiative/koza/blob/main/examples/maps/custom-entrez-2-string.yaml

kevinschaper commented 1 year ago

This step really hasn't caused us enough trouble to make me think that we need to add Koza code to avoid it, so I'm going to close this