This PR contains a module to manage seqr syncing between Metamist and seqr. The idea is to abstract the aspects of the seqr sync process so that the data synced to a seqr project can be highly customized / tailored.
Metamist already has an existing seqr db layer . This module looks to replace the existing layer with a more robust and abstracted implementation while reusing existing code where possible. A lot of the code in this module was also lifted from the sync_seqr.py script in /scripts.
To break down the module:
seqr_sync.py The main part of the module that takes the transformed data and posts it to seqr. Any script used to sync data to seqr should instantiate an instance of this class and call the sync_dataset methods.
data_fetchers.py Contains classes MetamistFetcher and FileFetcher. These classes contain methods to get data from Metamist and files respectively. The data will then need to be transformed into seqr's expected formats before being loaded.
data_transformers.py Contains the SeqrTransformer class which converts from data formats output by Metamist into the data formats expected by seqr. e.g. processing ped sex & affected values, processing hpo terms, formatting the es-index json post, etc.
config.py The definitions and global variables that clutter the top of the sync_seqr.py Metamist script have been put into this file for cleaner access.
utils.py Contains helper methods needed by parts of the sync process, e.g. writing the SG - PID map to the bucket, diffing sequencing groups when loading a new es-index.
logging_config.py Neatly contain the logging initialization for simple import and use.
Still TODO:
[ ] Formalise how and where exactly this module belongs in the Metamist repo.
A lot of the methods added in this PR leverage GraphQL queries, however it seems like the convention is that db layers should not use GQL API, and instead just use the standard REST APIs. Is it ok to add a module that leverages GQL like this? Why / why not?
Seqr Sync module
This PR contains a module to manage seqr syncing between Metamist and seqr. The idea is to abstract the aspects of the seqr sync process so that the data synced to a seqr project can be highly customized / tailored.
Metamist already has an existing
seqr
db layer . This module looks to replace the existing layer with a more robust and abstracted implementation while reusing existing code where possible. A lot of the code in this module was also lifted from the sync_seqr.py script in/scripts
.To break down the module:
seqr_sync.py
The main part of the module that takes the transformed data and posts it to seqr. Any script used to sync data to seqr should instantiate an instance of this class and call thesync_dataset
methods.data_fetchers.py
Contains classesMetamistFetcher
andFileFetcher
. These classes contain methods to get data from Metamist and files respectively. The data will then need to be transformed into seqr's expected formats before being loaded.data_transformers.py
Contains theSeqrTransformer
class which converts from data formats output by Metamist into the data formats expected by seqr. e.g. processing ped sex & affected values, processing hpo terms, formatting the es-index json post, etc.config.py
The definitions and global variables that clutter the top of thesync_seqr.py
Metamist script have been put into this file for cleaner access.utils.py
Contains helper methods needed by parts of the sync process, e.g. writing the SG - PID map to the bucket, diffing sequencing groups when loading a new es-index.logging_config.py
Neatly contain the logging initialization for simple import and use.Still TODO:
generate_seqr_auth_token
,send_slack_notification