sanger-tol / variantcalling

Nextflow DSL2 pipeline to call variants on long read alignment.
https://pipelines.tol.sanger.ac.uk/variantcalling
MIT License
3 stars 2 forks source link

Assess Popgen48/scalepopgen #80

Closed muffato closed 2 months ago

muffato commented 5 months ago

We need to review how much of our population genomics ideas Popgen48/scalepopgen can do to determine:

Links: poster

Summary

  1. All the different tools and analyses can be independently enabled.
  2. There were a couple of things to do to the input VCF files, but then the pipeline runs fine.
  3. We'd want to clarify whether it's going to be part of nf-core or not.
  4. We need to decide in which pipeline (scalepopgen or a new pipeline) the ROH and population size analyses should go.

Next developments

Based on the tests above, to use scalepopgen, we would want to:

hangxue-wustl commented 4 months ago

Requirement for input files.

  1. All VCF files need to be splitted by the chromosomes and indexed with tabix.
  2. Sample map has two tab-delimited columns without header line. In the first column are individual IDs and in the second are population IDs

vcf_input.csv: chrom,vcf,vcf_idx chr1,chrom1.vcf.gz,chrom1.vcf.gz.tbi chr2,chrom2.vcf.gz,chrom2.vcf.gz.tbi

sample.map: ind1 pop1 ind2 pop1 ind3 pop2 ind4 pop2

Splitting the VCF file by chromosomes bcftools index -s mLutLut_renamed_autosomes_bisnps.vcf.gz | cut -f 1 | while read C; do bcftools view -O z -o split.${C}.vcf.gz mLutLut_renamed_autosomes_bisnps.vcf.gz.vcf.gz "${C}" ; done

hangxue-wustl commented 4 months ago

Downloaded supplementary data from https://doi.org/10.1093/molbev/msad207 and followed EurasianOtter_PopGen.html to obtain vcf.gz files and rename samples, and select only autosomes and bialleleic SNPs for analyses. Split the vcf file by chromosomes using bcftools. Ran "nextflow run scalepopgen -profile singularity -params-file /global/scratch/users/hangxue/otter/vcf_publication/jul4_parameters.yml -qs 10". See output graphs at https://docs.google.com/presentation/d/1O8vFmYImrJd6p4pvSLyzwiMsf9fTAZSTaG_FJGLz8t8/edit#slide=id.p

hangxue-wustl commented 4 months ago

Tested PCA, Admixture, Pairwise Fst and Treemix in scalepopgen. These can run successfully with little modifications. Scalepopgen can also do Tajimas_D and search for selective sweeps selection (Sweepfinder2), but plotting the these two results requires the type of the chromosome name being integer. Out of these, Sweepfinder2 takes the longest, ~7hr for the otter data, followed by admixture ~1hr. Additional potential analysis:

  1. ROH identification (eg. RzooROH)
  2. Estimate population-size inference (eg. GONe)
muffato commented 3 months ago

Regarding the otter data. Here is more information about the sample confusion that occurred during that project.

The label swaps were very visible on the admixture plots, see left (labels corrected) vs right (wrong labels) Admixture In your pipeline run it's only k=2 that is a bit messy. All the other k are clean. I think you may have the correct labels and the differences are due to different methods / parameters ?

hangxue-wustl commented 3 months ago

I have doubled checked the label. I think the ones I am working with is labeled correctly. Yeah, I think the difference might be due to different softwares / parameters