Add Functionality to Filter PCA Background Population in Large Cohort Pipeline Ancestry Stage

michael-harper commented 3 weeks ago

This PR introduces an enhancement to the ancestry stage of the large cohort pipeline. The update allows for a more granular PCA by enabling the subsetting of the background population to individuals of a specific ancestry.

Changes include:

A new configuration option under [large_cohort.pca_background] named superpopulation_to_filter. This option allows users to specify the superpopulation to be used for the PCA. If not specified with population name, the new parameter defaults to False.

Additional lines of code in ancestry_pca.py that filter the background_mt matrix table based on the superpopulation specified in the configuration.

This enhancement provides users with the flexibility to perform PCA on a specific superpopulation, allowing for more detailed and targeted analysis.

michael-harper commented 3 weeks ago

Successful batch run here: https://batch.hail.populationgenomics.org.au/batches/455689/jobs/2

INFO:root:Superpopulations before filtering ['Europe', 'Europe', 'Europe', 'Europe', 'Europe', 'Europe', 'Europe', 'Europe', 'Europe', 'Europe', 'Europe', 'Europe', 'Europe',...]
INFO:root:Filtering background samples by ['Africa', 'South Asia']
INFO:root:Finished filtering background, kept samples that are ['Africa', 'South Asia']
INFO:root:Superpopulations after filtering ['Africa', 'Africa', 'Africa', 'Africa', 'Africa', 'Africa', 'Africa', 'Africa', 'Africa',...]

michael-harper commented 3 weeks ago

The cramqc was an oversight by me! Working on a different branch and that snuck its way in

populationgenomics / production-pipelines

Add Functionality to Filter PCA Background Population in Large Cohort Pipeline Ancestry Stage #789