samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
662 stars 240 forks source link

bcftools reference #2181

Closed felipeiani closed 4 months ago

felipeiani commented 4 months ago

For genomic analyses, primarily but not exclusively viral, we have been using a pipeline called Viralunity (https://github.com/filiperomero2/ViralUnity). However, I have noticed that depending on the reference used, the results vary significantly. For example, at some sites, I may have the nucleotide "a" sequenced at a depth of 1000X and "c" only 5X. If the reference is "c", the consensus will come out as "C". The command we are using within the script is:

# Get preliminary consensus sequence
echo "Inferring consensus sequences..."
bcftools consensus -f $reference $name.calls.norm.flt-indels.vcf.gz > $name.temp.consensus.fa

# Calculate coverage (depth) statistics
bedtools genomecov -bga -ibam $name.sorted.bam > $name.table_cov.txt
bedtools genomecov -d -ibam $name.sorted.bam > $name.table_cov_basewise.txt

# Mask sites if necessary
awk '$4 < '$minimum_depth'' $name.table_cov.txt > $name.table_coverage_min-cov-$minimum_depth.txt
sites_to_mask=$(wc -l $name.table_coverage_min-cov-$minimum_depth.txt | sed 's/\s.*//')
if [ $sites_to_mask -gt 0 ]
then
    bedtools maskfasta -fi $name.temp.consensus.fa -fo masked.$name.consensus.fa -bed $name.table_coverage_min-cov-$minimum_depth.txt
else
    cp $name.temp.consensus.fa masked.$name.consensus.fa
fi

# Rename consensus sequence with sample name
sed -i "s+$reference_name+$name+" masked.$name.consensus.fa

# Compute statistics and write to the stats report
writestats

# Compress fastq and move files to their respective directories
gzip *fastq
clean

echo ""
date
date >> ../timereport.txt

# Go to the previous directory
cd ../

Is that so? Is there any additional command we could add to set a minimum depth or frequency? Or perhaps specify that the consensus should have a minimum proportion, such as 70% for the most frequent base and 30% for the least frequent, to be considered true?

Sinceraly,

pd3 commented 4 months ago

In order to keep the program relatively simple, consensus merely applies variants from the VCF regardless of their depth, VAF, AF, or other properties. It does support the -i/-e filtering options, but if more more advanced logic is required, the VCF has to be preprocessed with other tools, such as the +setGT plugin (http://samtools.github.io/bcftools/howtos/plugin.setGT.html). I suggest you use that to enforce the desired genotype.

To cross reference, this issue is similar to #1488 or https://github.com/samtools/bcftools/issues/1516#issuecomment-875179854.