samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
663 stars 240 forks source link

Help separating alleles for 'merge' when ploidy >2 #2119

Closed IsaacTollestrup closed 4 months ago

IsaacTollestrup commented 7 months ago

I am a beginner.

I want to merge my annotated .vcf files to look at alll SNPs in Variant Effect Predictor. Some of my .vcf files are ploidy 3, 4 and 5. They were separated out before variant calling.

bcftools merge doesn't work when ploidy is greater than 2. So I'm trying to separate out all haplotypes with the workaround on the FAQ page:

"Note that the program only works with ploidy 1 or 2, so if defined as Number=G and the ploidy is bigger, the program will fail. If the tag is not important for your analysis, a quick and dirty workaround is to remove the tag from the VCF completely (bcftools annotate -x TAG)."

I was hoping to use this with my list file (is there a way to do this?) but I couldn't find a way. Instead I used ChatGPT to help write a script (I'm sorry):

#!/bin/bash

# Change to the working directory for output .vcf.gz files
cd /nesi/nobackup/vuw03939/combined_vcf/

# Display the current directory path for troubleshooting
pwd

# Path to the directory containing your .vcf.gz files
vcf_dir="../nextflow/liti_3iploid/sarek_anno_3ploid_snpeff/annotation"

# Output directory for modified files
output_dir="./combined_3ploid"
mkdir -p "$output_dir"

# List of .vcf.gz files
file_list="3ploid.list"

# Display the file list for troubleshooting
echo "Files to process:"
cat "$file_list"

# Loop through each file in the list and remove all tags
while IFS= read -r vcf_file || [ -n "$vcf_file" ]; do
# Extract just the filename without the directory path
filename=$(basename "$vcf_file")

# Generate the output file name by removing the ".vcf.gz" extension and adding "_removed.vcf.gz"
output_file="${filename%.vcf.gz}_removed.vcf.gz"

# Use BCFTools to remove all tags
bcftools annotate -x INFO -O z -o "${output_dir}/${output_file}" "${vcf_dir}/${vcf_file}"

# Construct the path to the corresponding TBI file using the directory of the VCF file
tbi_file="${vcf_dir}/${vcf_file}.tbi"

# Check if the corresponding TBI file exists and then copy it
if [ -e "$tbi_file" ]; then
    cp "$tbi_file" "${output_dir}/${output_file}.tbi"
else
    echo "Warning: ${tbi_file} file not found"
fi
done < "$file_list"

This achieves the removal of, in this case, INFO - which works. I have tried other options like TAG (get error: No matching tag in -x TAG) and FORMAT/PL, which also seems to work okay. The last part copies over the index files to the new directory.

But when I go to merge these .vcf files:

bcftools merge combined_3ploid/*.gz -Ov -o combined_3ploid.vcf

I get the error: [E::get_intv] Failed to parse TBX_VCF, was wrong -p [type] used?

Followed by The offending line was:

lots of "moved.vcf.gz ../nextflow/liti_3iploid/sarek_anno_3ploid_snpeff/annotation/./ABV/ABV.haplotypecaller_snpEff.ann.vcf.gz; Date=Thu Mar 7 15:56:42 2024"

and a few "G GTTA 3397.98 . AC=1;AF=0.333;AN=3;BaseQRankSum=-1.37;DP=333;FS=2.861;MLEAC=1;MLEAF=0.333;MQ=60;MQRankSum=0;QD=13.17;ReadPosRankSum=-1.148;SOR=0.822;ANN=GTTA|disruptive_inframe_insertion|MODERATE|URE2|YNL229C|transcript|YNL229C_mRNA|protein_coding|1/1|c.144_146dupTAA|p.Asn49dup|146/1065|146/1065|49/354||,GTTA|upstream_gene_variant|MODIFIER|PDR16|YNL231C|transcript|YNL231C_mRNA|protein_coding||c.-3016_-3014dupTAA|||||3014|,GTTA|upstream_gene_variant|MODIFIER|ELA1|YNL230C|transcript|YNL230C_mRNA|protein_coding||c.-1396_-1394dupTAA|||||1394|,GTTA|upstream_gene_variant|MODIFIER|YNL228W|YNL228W|transcript|YNL228W_mRNA|protein_coding||c.-590_-589insTTA|||||589|,GTTA|upstream_gene_variant|MODIFIER|YNL226W|YNL226W|transcript|YNL226W_mRNA|protein_coding||c.-2184_-2183insTTA|||||2183|,GTTA|downstream_gene_variant|MODIFIER|CSL4|YNL232W|transcript|YNL232W_mRNA|protein_coding||c.*4254_*4255insTTA|||||4255|,GTTA|downstream_gene_variant|MODIFIER|JJJ1|YNL227C|transcript|YNL227C_mRNA|protein_coding||c.*601_*603dupTAA|||||603|,GTTA|downstream_gene_variant|MODIFIER|CNM67|YNL225C|transcript|YNL225C_mRNA|protein_coding||c.*2666_*2668dupTAA|||||2668|,GTTA|downstream_gene_variant|MODIFIER|SQS1|YNL224C|transcript|YNL224C_mRNA|protein_coding||c.*4738_*4740dupTAA|||||4740| GT:AD:DP:GQ 0/0/1:165,93:258:99"

Besides the part about the .vcf being moved, the other "offending lines" seem quite random.

Here's my dirty laundry, and again I am pretty new to all of this - I'm sure there are many errors in my approach, so I look forward to your input, and/or ideas about alternative workarounds.

pd3 commented 6 months ago

Please check the manual page and the howtos to get more familiar with the program http://samtools.github.io/bcftools/bcftools.html http://samtools.github.io/bcftools/howtos/index.html

Regarding the ploidy, you can review the VCF header and see if there are any tags that are defined as Number=G, them remove them prior to merging with bcftools -x TAG_NAME.

For merging you'll need index files. Indexing (bcftools index) works only on VCF or BCF files compressed with bgzip (bcftools view -o out.vcf.gz in.vcf or bcftools view -o out.bcf in.vcf).

It seems all pieces are there, just need to connect them together. I am not going to review your chatGPT script, sorry.