nikostourvas / acorn_poolseq_pipeline

0 stars 0 forks source link

acorn_poolseq_pipeline

This pipeline is designed for the ingestion of short-read sequencing data and performs all the necessary steps (quality assessment, quality trimming, mapping to reference genome, variant calling) for the creation of standard VCF files which hold the information of all detected variants in the provided data set. VCF files can then be used for the exploration of adaptive genetic diversity through the outlier detection method provided by pcadapt R package.

version management - container

The recommended way to run the pipeline is by using the poolseq_tools software container https://github.com/nikostourvas/poolseq_tools/tree/main This will ensure that the scripts are run with the software versions that were used during the development of the pipeline.

How to run this pipeline

In these sections we describe the scripts that make up the bioinformatics pipeline, and how did we ran them. If relevant, we describe what tests were necessary to determine settings/thresholds.

Raw reads

Upon receiving raw reads, we unzipped any files that were zipped using .bz2 protocol and unzipped them to .gz format. The script bz2gzip.sh is handy for this, but it can also be done manually without much headache.

All files were placed in a single directory. A .txt file was prepared with one sample name per line. GNU parallel was used to iterate the fastp_dedup.sh script per sample. There are two implementations of fastp in this script. The first generates a unique hash for each read and then removes duplicates with identical hashes such that for each sequence identity only one read remains. It also performs base correction based on paired-end reads overlap analysis. This way, mismatched base pairs with low quality in one of the strands are corrected. The second step is a simple quality trimming implementation. The trimming parameters in this script are set based on commonly accepted best practices. In order to pass, a read needs to have an average phred score of 20 across windows of 4 basepairs. Reads shorter than 25 bp are removed altogether.

The fastp_dedup.sh script generates a ton of reports. These are all gathered using the multiqc tool.

Mapping

Provided nothing is wrong according to the report generated by fastp and multiqc, the raw reads are mapped with bwa mem2 using the following script: map.sh We map to the Q. robur reference constructed by Plomion et al. (2018). We included organelle sequences in the reference (mitochondrion, chloroplast).

After mapping, we used Qualimap to assess the success rate and quality distribution of mapping. From the report generated by Qualimap, we could observe that there was a steep drop in mapping quality of reads aligned to shorter contigs.

The output of map.sh are our ‘raw’ BAMs. These bams were filtered using the script bam_filtering_no_markdup.sh, which applies some basic hard filters (see script documentation) and removes any reads that were mapped to reference contigs that are shorter than 500 Kbp (threshold based on observing Qualimap reports). The script also indexes the BAMs for easier downstream processing.

Variant calling

The variant calling step is not performed on the entire genome at once, but instead on arbitrarily defined regions, or ‘chunks’, of the genome. The chunks are defined using the Simplified_Chunks.sh script. The user decides the approximate size of a chunk (arbitrarily) and the script outputs chunks of the genome that are the specified size or smaller. This script outputs single-line bed files that can be used by the variant calling script to find the corresponding region. The files describing the chunks are stored with the reference.

The variant calling is performed using the script variant_calling.sh. It runs across all pool-sequenced samples, starting one computing job per chunk of the genome. In practical terms this means that variant calling can be sped up by the number of cores that your system has available. This script outputs two VCFs per chunk. The two VCFs contain information about SNPs and INDELs respectively.

A separate variant calling step is performed for individually sequenced samples with the script variant_calling_individuals.sh. It outputs one SNP vcf and one INDEL vcf per chunk, just like variant_calling.sh. This step is repeated for different subsetting scenarios.

After calling variants, we run post_variant_calling.sh. This script manages several processing steps. It initiates three separate bash scripts that all perform a single task. It can handle both the output of variant_calling.sh or variant_calling_individuals.sh. This step is repeated for different subsetting scenarios. In order, the scripts initiated by post_variant_calling.sh are: Concatenate_VCFs.sh takes all chunk-level VCFs and concatenates them in one file. It does this for both SNP and INDEL VCFs. snp_indel_rm.sh removes all SNPs in the SNP VCF that are found in close proximity to INDELs found in the INDEL VCF. Also performs some basic quality filtering (see script). vcf_biallelic.sh removes all SNPs that are not biallelic and outputs some basic statistics about the vcf.

SNP filtering

Before further filtering of the VCFs, we constructed a mask using the HDplot method. This mask is used to remove sites that show excess heterozygosity (i.e. paralogs mapping to the same site). This can only be done on the individually sequenced samples. The first step in constructing this mask is to extract read depths from the individuals VCF using a modified python script: vcf_to_depth_VarScanCompatible_HDplot_ACORN.py The read depths are then used to calculate heterozygosity using the script HDplot_python_ACORN.py. This script outputs several graphs.

We found that not many sites display excess heterozygosity. We implemented a stringent D=10 and a maximum H of 0.6. The mask we construct does not have a windowed approach. It only removes the specific sites that display excess heterozygosity. This is to avoid removing too many usable sites. Since excess heterozygosity does not appear to pose the biggest problem for us, we are happy to remove only the sites that we are sure have excess heterozygosity.

All downstream VCF filtering is managed by VCF_Basic_filtering.sh. This step is repeated for different subsetting scenarios. The script takes an unfiltered VCF and user-specified parameters for minimum read depth (20), maximum degree of missingness per SNP (7.5%) and a minimum threshold for the frequency of the minor allele (5%). The script also calculates the maximum read depth threshold (mean RD + 2 x standard deviation from the mean RD, which translates to removing the top ~2.45% of sites based on RD) and applies it. Finally, the script applies the HDplot mask.

The thresholds for each of the user-specified parameters was determined by trial-and-error. Each possible output was evaluated using a combination of the script AF_validation.Rmd and observing VCF_Basic_filtering.sh’ output for missingness per sample (.imiss files).

The thresholds that were used to construct final VCFs were as follows:

min RD = 20 Ideally this threshold would be set to 40 to acknowledge the on average 20x2 alleles that are present in a pool and can theoretically contribute reads. A lack of reads forces us to set the threshold much lower. We see little meaningful increase in AF accuracy as minimum read depth is increased to for instance 25 or 30, but we do see a significant drop in the number of SNPs that are retained. Since our downstream analyses are exploratory in nature, trading marginal gains in accuracy for large losses in breadth of coverage are not worthwhile.

max per-SNP missingness = 0.075 (7.5%) This threshold was informed by the degree of missingness across a sample. For imputation in LFMM, this is ideally kept under 10%. Since we cannot afford to lose samples (which would lead to issues in the power of our downstream analyses), we have to allow for some missingness on a per-SNP level. 7.5% was the lowest threshold we could apply across all relevant VCFs such that per-individual missingness never exceeded 10%.

min frequency of the minor allele = 0.05 (5%) This threshold is purely mathematical: since the minimum per-sample number of reads present at any locus is 20, the minimum contribution of an allele would be 5% of the reads. Any lower, and it becomes very likely that it is a sequencing error instead of an actual allele. In general, a MAF filter of 5% is frequently used in outlier tests.

The script Make_Frequency_Table.sh was used to generate allele frequency tables corresponding to each of the VCFs. This script is entirely custom to our application of poolseq/VarScan. These allele frequency tables can be used for downstream analyses using e.g. LFMM.

Validation of allele frequencies recovered by Pool-seq

To evaluate whether allele frequencies (AF) were accurately recovered from Pool-seq data, the trees comprising populations 305 and 312 were sequenced individually in addition to pool sequencing. These two populations were selected as representative examples of a population with samples of typical DNA concentration (305, ~40ng/μL) and low DNA concentration (312, ~8ng/μL). All the filters described in the VCF filtering section were applied to both the individuals- and the pool-VCF and no missing data were retained. AF for both data sets were extracted using a custom script (Make_Frequency_Table.sh). The accuracy of the Pool-seq-retrieved AF were assessed by calculating correlation coefficients and mean absolute errors. To graphically validate results, we plotted (i) 2D-histograms of AF from ind- and pool-seq, (ii) pool-seq minor allele frequency vs AF differences (Pool AF - Ind AF), and (iii) unfolded AF spectrums. To investigate whether an increase of the read depth threshold was associated with a substantial increase in AF accuracy, the above comparisons were performed for SNPs called for minimum read depths of 20 and 30.