ksamuk / pixy

Software for painlessly estimating average nucleotide diversity within and between populations
https://pixy.readthedocs.io/
MIT License
115 stars 14 forks source link

invariant vcf with ALT "*" instead of "." #105

Open macurqcron opened 4 months ago

macurqcron commented 4 months ago

I have an allsites vcf, containing invariant and variant sites, that I have properly filtered (e.g., missing data, depth). Since I used vcftools at a certain stage [to update, I have tried without using vcftools and using only bcftools and the invariant sites still are transformed to "*"], the "." symbols in the ALT column (what bcftools recognizes as an invariant site) have been changed to "*". When running pixy, I got the warning/solution "ERROR: the provided VCF appears to contain no invariant sites (ALT = "."). This check can be bypassed via --bypass_invariant_check 'yes'."

My question is, if I move forward with --bypass_invariant_check 'yes' will that impact the algorithms ability to differentiate between variant and invariant sites? And if so, is the best way forward to figure out how to change instances of "*" to "." in my allsite vcf ALT column.

macurqcron commented 4 months ago

Just to follow up - I was unaware github also had "closed issues" before posting. Seems like similar questions have been asked and answered there about --bypass_invariant_check, but my question does still seem unique so I will leave it up

AliBasuony2022 commented 1 month ago

Hi,

Sorry, I can't help with this issue. I'm just wondering how did you filtered invariants? Did you split the all sites vcf to variant and invariants, and then filtered each one separately followed by concatenate in one vcf file? Filtering of variants (SNPs) seems straitforward, but I'm just confused how to filter invariants. I have an invariants file with many missing data and when filtered it dropped more than 99% of the data ( from 116,000,000 to 1,000,000 lines).

Any help will be appreciable

thanks Ali

macurqcron commented 1 week ago

Hi Ali,

I did split the allsites vcf to variant and invariant flies. I already had a variant file I made previously, so I focussed on creating the invariant file from the allsites vcf. Once I had the two files, I did concatenate them to create the filtered allsites vcf to use with pixy.

First, I had to create and allsites vcf by calling allsites (variant and invariant SNPs) during the SNP calling step. This is the code I used:

module load bwa/0.7.17
module load samtools/1.16.1
module load gatk/4.2.4.0
module load vcftools/0.1.16

name=$( awk "NR==$SLURM_ARRAY_TASK_ID" ./ref_genome/chromosomelist_vcf.txt)

gatk GenotypeGVCFs \
    -R ./ref_genome/LF10g_v2.0.fa \
    -V gendb://db/ch3_"$name" \
    -all-sites -L "$name" -O ./vcf/allsites_"$name".vcf.gz

I parallelized my code on SLURM, the first line after loading the modules just refers to a text file that has the individual chromosome names from my reference genome. The important addition here is -all-sites -L "$name" without that the code would only keep variant sites (perhaps explaining why your invariant file was missing so much data?).

To make my invariant vcf, I ended up creating a series of lists and then filtering based on those lists since trying to directly filter invariant sites using bcftools or vcftools wasn't working. The key thing I used was invariant sites are denoted by "*" in the ALT column.

First, I filtered the allsites vcf based on quality (also tried to filter MAF = 0, but didn't work to remove variant sites)

module load bcftools/1.16

cd vcf

    bcftools view \
    -S ^filtered_indv_mean_depth.txt
    -i 'MAF = 0 && F_PASS(FORMAT/DP>0) > 0.8' \
    -m 2 \
    -M 2 \
    -O z \
    allsites_ch3_full_genome.vcf.gz > allsites_ch3_filtered_md.vcf.gz

Then, to create my lists, I used my already made variant vcf file as a guideline for different filtering thresholds

bcftools view -H allsites_ch3_filtered_md.vcf.gz  | wc -l 
# 338166
# count # of sites in variant vcf
bcftools view -H ch3_full_genome_filtered_ql.vcf.gz | wc -l
# 186746

# extracting chromosome, position, REF, ALT column data
bcftools query -f '%CHROM %POS %REF %ALT\n' allsites_ch3_filtered_md.vcf.gz -o allsites_ch3_filtered_md.txt

# creating text file filtered to only include invariant sites (defined by ALT = "*")
awk '$4 == "*"' allsites_ch3_filtered_md.txt > allsites_invariantsites_ch3_filtered_md.txt

# 
wc -l allsites_invariantsites_ch3_filtered_md.txt
# 98407 sites are invariant (after filtering for missing data)

# created a list with specific sites that are invariant to filter by 
awk '{print $1 "\t" $2}' allsites_invariantsites_ch3_filtered_md.txt > invariant_sites_md_filter.txt

then, I used bcftools to create my invariant vcf by filtering my allsites vcf for those sites specifically, plus removing individuals with missing data that I knew based on my previous work creating my variant vcf

bcftools view \
    -T invariant_sites_md_filter.txt \
    -S ^filtered_indv_mean_depth.txt \
    -O z \
     allsites_ch3_filtered_md.vcf.gz > invariant_ch3_filtered_md_indv.vcf.gz

now I have an invariant vcf that has been filtered for missing data per site (-T command), missing data and low depth per individual (-S command). I will examine what individual and site depth looks like, to further filter the data by site depth, and then I will concatenate the invariant file with my previously filtered variant file (that has been used for all other analyses)

module load StdEnv/2020
module load vcftools/0.1.16

## Calculate mean depth per individual 
vcftools --gzvcf invariant_ch3_filtered_md_indv --depth --out invariant_ch3_filtered_md_indv

## Calculate mean depth per site
vcftools --gzvcf invariant_ch3_filtered_md_indv --site-mean-depth --out invariant_ch3_filtered_md_indv

will generate a list of sites within a certain threshold (see what I used to filter variant data OR look at distributions in R) and will filter the invariant data (.ldepth.mean text file) based on that information to have my final invariant vcf (filtered by depth) that I will then concatenate to my filtered vcf (depth, quality, missing data and missing/low depth individuals)

filtering values by column 3, selecting only the first two columns to print, and removing the header. Here I am filtering by a minimum depth of 10 and a maximum depth of 55, and this was selected based on my mean depth (~25x)

awk 'NR==1 {print $1 "\t" $2}; NR>1 && $3 > 10 && $3 < 55 {print $1 "\t" $2}' <(tail -n +2 invariant_ch3_filtered_md_indv.ldepth.mean) > invariant_sites_depth_filter.txt

# filter invariant vcf to specific depth thresholds using list from above
bcftools view \
    -T invariant_sites_depth_filter.txt \
    -O z \
     invariant_ch3_filtered_md_indv.vcf.gz > invariant_ch3_filtered_md_indv_depth.vcf.gz

 # index vcf files
# invariant vcf
     bcftools index invariant_ch3_filtered_md_indv_depth.vcf.gz 
# variant vcf
     bcftools index ch3_full_genome_filtered_site-indv_dp.ql.vcf.gz

# final step, concatenating the two files for a filtered, allsites vcf! 

bcftools concat \
--allow-overlaps \
invariant_ch3_filtered_md_indv_depth.vcf.gz ch3_full_genome_filtered_site-indv_dp.ql.vcf.gz \
-O z -o pixy_invariant_variant_filtered.vcf.gz

# index your allsites filtered vcf
tabix pixy_invariant_variant_filtered.vcf.gz

now you have a file ready for pixy here is my code for running pixy, note you will need the line --bypass_invariant_check 'yes' since the invariant sites are not coded the way pixy expects, but they exist.

pixy --stats pi dxy \
--vcf pixy_ch3_invariant_variant_filtered.vcf.gz \
--populations pixy_PopFile_ch3.txt \
--window_size 1000000 \
--n_cores 8 \
--bypass_invariant_check 'yes' \
--chromosomes 'LF10_chr1,LF10_chr2,LF10_chr3,LF10_chr4,LF10_chr5,LF10_chr6,LF10_chr7,LF10_chr8'

Hope this helps, I understand this might be overwhelming. But, it is how I accomplished generating an allsites vcf