opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Incorporate scQTLs from Kaur #3332

Closed ireneisdoomed closed 3 weeks ago

ireneisdoomed commented 3 weeks ago

As a developer I want to process credible sets derived from single cell QTLs because this will enable more precise identification of eQTLs specific to particular cell types, improving our understanding of genetic regulation and its implications in disease and development.

Some context

scRNA-seq offers significant advantages over bulk RNA sequencing by allowing the study of gene expression at the resolution of individual cells. This helps in identifying eQTLs specific to particular cell types, capturing the diversity of gene expression patterns within and between cell types, and observing temporal dynamics of gene expression changes in specific cell types.

Data availability

We have credible sets from 10 studies: Aygun_2021, PISA, Walker_2019, Sun_2018, Randolph_2021, Perez_2022, OneK1K, Jerber_2021, Nathan_2022, Cytoimmgen, Kim-Hellmuth_2017.

Although initially available through the Sanger farm, the latest credible sets are public since March 31st in the FTP which is convenient because we can pull from a single location to ingest all results.

The results are served as compressed files containing summary statistics and susie results, split by datasets that represent different quantification methods.

Data inspection

Each dataset includes:

QTD000564.lbf_variable.txt.gz

-RECORD 0------------------------------------- molecular_trait_id | ENSG00000182362 region | chr21:45286342-47286342 variant | chr21_45287004_G_A chromosome | 21 position | 45287004 lbf_variable1 | -1.9982103201721 lbf_variable2 | -0.0529486746200503 lbf_variable3 | -0.0497269339290685 lbf_variable4 | -0.0248066525814772 lbf_variable5 | -0.00783420001459678 lbf_variable6 | -0.0017594415663138 lbf_variable7 | -0.000282958881470341 lbf_variable8 | -5.33928563872799e-06 lbf_variable9 | 2.34509269450012e-05 lbf_variable10 | 1.49791941113087e-05



The study metadata is maintained in this [metadata table](https://github.com/eQTL-Catalogue/eQTL-Catalogue-resources/blob/1994b5ba66b2e88e89ba91f5638e40ccd3985306/data_tables/dataset_metadata_upcoming.tsv#L720).

## Tasks
- [x] Rerun the job to sync betwee the FTP folder containing all results and our Google Cloud Bucket
- [x] Identify particularities between the single cell and the bulk derived results -> None in terms of input data
- [x] Ensure the current pipeline is adaptable to the new data -> In terms of schema, I have just renamed the column that referred to the tissue to `biosample`, a more generic name that work for both levels
- [ ] Add a literature reference per study. I've opened a [PR](https://github.com/eQTL-Catalogue/eQTL-Catalogue-resources/pull/40) to the eQTL Catalogue data so that it's easier to compare between releases by looking at the respective PMIDs. Still pending to be approved
- [x] Rerun the QC to check that provided PIPs are correctly calculated

More details of the results in the [PR](https://github.com/opentargets/gentropy/pull/630).