V2D enhancements - Githubissues

ireneisdoomed commented 2 years ago

This is an attempt to collect all the Genetics Variant To Disease Pipeline steps to have an overall description of how it works and to identify points to work on to improve the overall functioning of the pipeline.

2 overall notes:

The pipeline is overall complex in terms that there are many steps involved, when thinking about QC it is important to identify which are the critical points.
The number of outputs is not very big, it is feasible to validate them by defining a JSON schema for each. Is this the right call? @DSuveges

Workflow DAG

Graph of dependencies of the different scripts that are called on the pipeline.

Workflow description

The whole pipeline consists on the creation of 4 files: study, fine mapping, LD and top loci tables.

Top loci table

The result of merging the associations from GWASCat and the summary statistics (merge_gwascat_and_sumstat_toploci).

GWASCat:

Download GWAS Catalogue associations (All associations v1.0.2 - with added ontology annotations, GWAS Catalog study accession numbers and genotyping technology)
Filter down the variant index to only include variants collected in GWAS Cat.
- (Problem: this is parsed and joined by reading inputs line by line)
- ~Bug: cannot run the code without incurring in encoding issue: ValueError: invalid literal for int() with base 10: '\ufeff79254949'~ Fixed by @Jeremy37
Populate variant annotation on the GWAS Catalogue data.
Join GWASCatalogue studies with a table of new studies and format.

Summary statistics:

Associations are coming from the output of the finemapping pipeline
- Problem: this file is encoded in the pipeline config to be using 'gs://genetics-portal-dev-staging/finemapping/merged_210515/top_loci.json.gz', is this the most up to date data? Yes, as noted by @Jeremy37, config is updated per release.

Study table - TBC

From independently generating 3 tables: GWASCat, UKBB, Finngen.

GWASCat:

Download GWAS Catalogue studies (All studies v1.0.2 - with added ontology annotations, GWAS Catalog study accession numbers and genotyping technology).
Split studies with multiple traits and merge this table with:
1. The table containing GWAS assocs enriched with variant information (almost top loci)
2. Ancestry annotation

Fine mapping table

Download credible set from the fine mapping pipeline.
- Problem: this file is encoded in the pipeline config to be using 'gs://genetics-portal-dev-staging/finemapping/merged_210515/credset/_SUCCESS', is this the most up to date data? Yes, as noted by @Jeremy37, config is updated per release.
- Enhancement: download_credible_set_directory has some weird logic to download the credible set, apparently the problem is that it is a directory. This should be handled with no problem.
- Format fine mapping table.

LD table

Create study/variant/population LUT: The result of inner joining study, top loci and the population map (to map from GWAS ancestry to 1000G superpopulation)
Create text file with all variants contained in the above LUT.
- Issue: Instead of using bash to get the list of variant ids in chrom:pos:ref:alt form, implement this step in calculate_r_using_plink and use Spark.
Calculate LD for the variant IDs in the LD lut. This process is paralleled with up to 300 threads.
- Problem: PLINK is called once per variant. Is it possible to input a file as explained here?
Generation of LD dataset:
- The above LUT is joined with the results of the LD analysis.
- Study/population weighted LD is calculated.
- Conducts credible set analysis with PICS using the toploci table. This is joined with the above table (LUT + LD)
- Output is a table with the studies enriched with the LD and PICS results.

Jeremy37 commented 2 years ago

FYI, I fixed the bug that caused the pipeline to crash on invalid input ('\ufeff79254949') from a small number of studies. (https://github.com/opentargets/genetics-v2d-data/commit/d58d0d4c6c7e6822fe566ee129890ba82045f2b7) I informed GWAS catalog of the studies with the data problem.

Re: finemapping location, this has also been updated. (And in general would need to be updated when doing each release.)

Regarding the LD table, I think that the best enhancement would be to get rid of PICS altogether. Then you don't need any LD. There is very little benefit to the PICS method, since it assumes a single causal variant. You might as well just do standard WCCC-style approximate Bayes factor fine-mapping, which wouldn't require an LD panel and so would be much more robust, and would be much faster.

d0choa commented 1 year ago

@DSuveges shall we close this one?

DSuveges commented 1 year ago

Not the entire v2d is implemented so far, but we are so out of scope of this ticket, we should close.

opentargets / issues

V2D enhancements #2065