ksamuk / pixy

Software for painlessly estimating average nucleotide diversity within and between populations
https://pixy.readthedocs.io/
MIT License
120 stars 14 forks source link

invariant vcf with ALT "*" instead of "." #105

Open macurqcron opened 7 months ago

macurqcron commented 7 months ago

I have an allsites vcf, containing invariant and variant sites, that I have properly filtered (e.g., missing data, depth). Since I used vcftools at a certain stage [to update, I have tried without using vcftools and using only bcftools and the invariant sites still are transformed to "*"], the "." symbols in the ALT column (what bcftools recognizes as an invariant site) have been changed to "*". When running pixy, I got the warning/solution "ERROR: the provided VCF appears to contain no invariant sites (ALT = "."). This check can be bypassed via --bypass_invariant_check 'yes'."

My question is, if I move forward with --bypass_invariant_check 'yes' will that impact the algorithms ability to differentiate between variant and invariant sites? And if so, is the best way forward to figure out how to change instances of "*" to "." in my allsite vcf ALT column.

macurqcron commented 7 months ago

Just to follow up - I was unaware github also had "closed issues" before posting. Seems like similar questions have been asked and answered there about --bypass_invariant_check, but my question does still seem unique so I will leave it up

AliBasuony2022 commented 4 months ago

Hi,

Sorry, I can't help with this issue. I'm just wondering how did you filtered invariants? Did you split the all sites vcf to variant and invariants, and then filtered each one separately followed by concatenate in one vcf file? Filtering of variants (SNPs) seems straitforward, but I'm just confused how to filter invariants. I have an invariants file with many missing data and when filtered it dropped more than 99% of the data ( from 116,000,000 to 1,000,000 lines).

Any help will be appreciable

thanks Ali

macurqcron commented 3 months ago

Hi Ali,

I did split the allsites vcf to variant and invariant flies. I already had a variant file I made previously, so I focussed on creating the invariant file from the allsites vcf. Once I had the two files, I did concatenate them to create the filtered allsites vcf to use with pixy.

First, I had to create and allsites vcf by calling allsites (variant and invariant SNPs) during the SNP calling step. This is the code I used:

module load bwa/0.7.17
module load samtools/1.16.1
module load gatk/4.2.4.0
module load vcftools/0.1.16

name=$( awk "NR==$SLURM_ARRAY_TASK_ID" ./ref_genome/chromosomelist_vcf.txt)

gatk GenotypeGVCFs \
    -R ./ref_genome/LF10g_v2.0.fa \
    -V gendb://db/ch3_"$name" \
    -all-sites -L "$name" -O ./vcf/allsites_"$name".vcf.gz

I parallelized my code on SLURM, the first line after loading the modules just refers to a text file that has the individual chromosome names from my reference genome. The important addition here is -all-sites -L "$name" without that the code would only keep variant sites (perhaps explaining why your invariant file was missing so much data?).

To make my invariant vcf, I ended up creating a series of lists and then filtering based on those lists since trying to directly filter invariant sites using bcftools or vcftools wasn't working. The key thing I used was invariant sites are denoted by "*" in the ALT column.

First, I filtered the allsites vcf based on quality (also tried to filter MAF = 0, but didn't work to remove variant sites)

module load bcftools/1.16

cd vcf

    bcftools view \
    -S ^filtered_indv_mean_depth.txt
    -i 'MAF = 0 && F_PASS(FORMAT/DP>0) > 0.8' \
    -m 2 \
    -M 2 \
    -O z \
    allsites_ch3_full_genome.vcf.gz > allsites_ch3_filtered_md.vcf.gz

Then, to create my lists, I used my already made variant vcf file as a guideline for different filtering thresholds

bcftools view -H allsites_ch3_filtered_md.vcf.gz  | wc -l 
# 338166
# count # of sites in variant vcf
bcftools view -H ch3_full_genome_filtered_ql.vcf.gz | wc -l
# 186746

# extracting chromosome, position, REF, ALT column data
bcftools query -f '%CHROM %POS %REF %ALT\n' allsites_ch3_filtered_md.vcf.gz -o allsites_ch3_filtered_md.txt

# creating text file filtered to only include invariant sites (defined by ALT = "*")
awk '$4 == "*"' allsites_ch3_filtered_md.txt > allsites_invariantsites_ch3_filtered_md.txt

# 
wc -l allsites_invariantsites_ch3_filtered_md.txt
# 98407 sites are invariant (after filtering for missing data)

# created a list with specific sites that are invariant to filter by 
awk '{print $1 "\t" $2}' allsites_invariantsites_ch3_filtered_md.txt > invariant_sites_md_filter.txt

then, I used bcftools to create my invariant vcf by filtering my allsites vcf for those sites specifically, plus removing individuals with missing data that I knew based on my previous work creating my variant vcf

bcftools view \
    -T invariant_sites_md_filter.txt \
    -S ^filtered_indv_mean_depth.txt \
    -O z \
     allsites_ch3_filtered_md.vcf.gz > invariant_ch3_filtered_md_indv.vcf.gz

now I have an invariant vcf that has been filtered for missing data per site (-T command), missing data and low depth per individual (-S command). I will examine what individual and site depth looks like, to further filter the data by site depth, and then I will concatenate the invariant file with my previously filtered variant file (that has been used for all other analyses)

module load StdEnv/2020
module load vcftools/0.1.16

## Calculate mean depth per individual 
vcftools --gzvcf invariant_ch3_filtered_md_indv --depth --out invariant_ch3_filtered_md_indv

## Calculate mean depth per site
vcftools --gzvcf invariant_ch3_filtered_md_indv --site-mean-depth --out invariant_ch3_filtered_md_indv

will generate a list of sites within a certain threshold (see what I used to filter variant data OR look at distributions in R) and will filter the invariant data (.ldepth.mean text file) based on that information to have my final invariant vcf (filtered by depth) that I will then concatenate to my filtered vcf (depth, quality, missing data and missing/low depth individuals)

filtering values by column 3, selecting only the first two columns to print, and removing the header. Here I am filtering by a minimum depth of 10 and a maximum depth of 55, and this was selected based on my mean depth (~25x)

awk 'NR==1 {print $1 "\t" $2}; NR>1 && $3 > 10 && $3 < 55 {print $1 "\t" $2}' <(tail -n +2 invariant_ch3_filtered_md_indv.ldepth.mean) > invariant_sites_depth_filter.txt

# filter invariant vcf to specific depth thresholds using list from above
bcftools view \
    -T invariant_sites_depth_filter.txt \
    -O z \
     invariant_ch3_filtered_md_indv.vcf.gz > invariant_ch3_filtered_md_indv_depth.vcf.gz

 # index vcf files
# invariant vcf
     bcftools index invariant_ch3_filtered_md_indv_depth.vcf.gz 
# variant vcf
     bcftools index ch3_full_genome_filtered_site-indv_dp.ql.vcf.gz

# final step, concatenating the two files for a filtered, allsites vcf! 

bcftools concat \
--allow-overlaps \
invariant_ch3_filtered_md_indv_depth.vcf.gz ch3_full_genome_filtered_site-indv_dp.ql.vcf.gz \
-O z -o pixy_invariant_variant_filtered.vcf.gz

# index your allsites filtered vcf
tabix pixy_invariant_variant_filtered.vcf.gz

now you have a file ready for pixy here is my code for running pixy, note you will need the line --bypass_invariant_check 'yes' since the invariant sites are not coded the way pixy expects, but they exist.

pixy --stats pi dxy \
--vcf pixy_ch3_invariant_variant_filtered.vcf.gz \
--populations pixy_PopFile_ch3.txt \
--window_size 1000000 \
--n_cores 8 \
--bypass_invariant_check 'yes' \
--chromosomes 'LF10_chr1,LF10_chr2,LF10_chr3,LF10_chr4,LF10_chr5,LF10_chr6,LF10_chr7,LF10_chr8'

Hope this helps, I understand this might be overwhelming. But, it is how I accomplished generating an allsites vcf

AliBasuony2022 commented 3 weeks ago

Hi Mackenzie,

First of all, I'm sorry for late response and thanks so much for the detailed expalanition.

I'm still stucking in the first step and I think I'm not fully understand how to parallelize my code on SLURM. I tried alot of things, but all of them gave an errors. Please see some attached examples of error files.

The errors seem to be related to two main things:

1) Using of awk, (I have googled about this but didn't find a solution, sorry).

error: awk: cmd. line:2: NR== awk: cmd. line:2: ^ unexpected newline or end of string

2) linking the chromosome names to those in the database (my-database_12) directory.

error: A- A USER ERROR has occurred: Badly formed genome unclippedLoc: Failed to parse Genome Location string: : intervalQueryString should not be empty.

When using "-V gendb:///mnt/ursus/GROUP-sbifh3/c1845371/whole_genome/data_dog/align_out/my_database12 \"

B- A USER ERROR has occurred: GenomicsDB workspace drivingVariantFile:gendb:///mnt/ursus/GROUP-sbifh3/c1845371/whole_genome/data_dog/align_out/mydatabase12 does not exist

When using " -V gendb:///mnt/ursus/GROUP-sbifh3/c1845371/whole_genome/data_dog/align_out/mydatabase12"$name"

I think the most non-understandable for me is 'ch3' in " V gendb://db/ch3"$name" \". Is it the prefix? similar to NC_ in 'NC_051844.1$1$3937623' in my-database-12 file.

What is the correct .txt file format of the four attached files to be linked to those in "my-database_12"?

I have attached my bash script and other files for your reference.

Once agian, thanks for your help and feel free to refuse to help if you are busy.

Yours sincerely, Ali

to macurqcron.zip