populationgenomics / production-pipelines

Genomics workflows for CPG using Hail Batch
MIT License
2 stars 0 forks source link

Add Ability to Remove Related Individuals from Background Dataset for PCA Analysis #807

Open michael-harper opened 5 days ago

michael-harper commented 5 days ago

Description:

Targeting the preprocessing step where we prepare the background dataset. A common challenge in genetic studies, especially those involving PCA, is the influence of related individuals on the analysis. Their presence can skew results, leading to inaccurate interpretations. To address this, we've implemented a feature that allows for the explicit removal of related individuals from the background dataset before conducting PCA (in addition to the option to remove related individuals from the dataset in question).

Key Changes:

Related Individuals Removal: Leveraging a precomputed table of related individuals relateds_to_drop_ht we now filter out these samples from background dataset. This ensures that our PCA analysis is conducted on a dataset free of relatedness biases. Configurable Removal List: The list of individuals to remove is configurable, depending on the background datasets used. The background datasets MUST HAVE BEEN RUN THROUGH THE Relatedness STAGE.

KatalinaBobowik commented 5 days ago

This looks great. The only thing I would suggest is naming relateds_to_drop to background_relateds_to_drop to differentiate between the background related samples/ht and the dataset relateds