Closed IsaacTollestrup closed 4 months ago
Please check the manual page and the howtos to get more familiar with the program http://samtools.github.io/bcftools/bcftools.html http://samtools.github.io/bcftools/howtos/index.html
Regarding the ploidy, you can review the VCF header and see if there are any tags that are defined as Number=G
, them remove them prior to merging with bcftools -x TAG_NAME
.
For merging you'll need index files. Indexing (bcftools index
) works only on VCF or BCF files compressed with bgzip (bcftools view -o out.vcf.gz in.vcf
or bcftools view -o out.bcf in.vcf
).
It seems all pieces are there, just need to connect them together. I am not going to review your chatGPT script, sorry.
I am a beginner.
I want to merge my annotated .vcf files to look at alll SNPs in Variant Effect Predictor. Some of my .vcf files are ploidy 3, 4 and 5. They were separated out before variant calling.
bcftools merge
doesn't work when ploidy is greater than 2. So I'm trying to separate out all haplotypes with the workaround on the FAQ page:"Note that the program only works with ploidy 1 or 2, so if defined as Number=G and the ploidy is bigger, the program will fail. If the tag is not important for your analysis, a quick and dirty workaround is to remove the tag from the VCF completely (
bcftools annotate -x TAG
)."I was hoping to use this with my list file (is there a way to do this?) but I couldn't find a way. Instead I used ChatGPT to help write a script (I'm sorry):
This achieves the removal of, in this case,
INFO
- which works. I have tried other options likeTAG
(get error:No matching tag in -x TAG
) andFORMAT/PL
, which also seems to work okay. The last part copies over the index files to the new directory.But when I go to merge these .vcf files:
bcftools merge combined_3ploid/*.gz -Ov -o combined_3ploid.vcf
I get the error:
[E::get_intv] Failed to parse TBX_VCF, was wrong -p [type] used?
Followed by
The offending line was:
lots of
"moved.vcf.gz ../nextflow/liti_3iploid/sarek_anno_3ploid_snpeff/annotation/./ABV/ABV.haplotypecaller_snpEff.ann.vcf.gz; Date=Thu Mar 7 15:56:42 2024"
and a few
"G GTTA 3397.98 . AC=1;AF=0.333;AN=3;BaseQRankSum=-1.37;DP=333;FS=2.861;MLEAC=1;MLEAF=0.333;MQ=60;MQRankSum=0;QD=13.17;ReadPosRankSum=-1.148;SOR=0.822;ANN=GTTA|disruptive_inframe_insertion|MODERATE|URE2|YNL229C|transcript|YNL229C_mRNA|protein_coding|1/1|c.144_146dupTAA|p.Asn49dup|146/1065|146/1065|49/354||,GTTA|upstream_gene_variant|MODIFIER|PDR16|YNL231C|transcript|YNL231C_mRNA|protein_coding||c.-3016_-3014dupTAA|||||3014|,GTTA|upstream_gene_variant|MODIFIER|ELA1|YNL230C|transcript|YNL230C_mRNA|protein_coding||c.-1396_-1394dupTAA|||||1394|,GTTA|upstream_gene_variant|MODIFIER|YNL228W|YNL228W|transcript|YNL228W_mRNA|protein_coding||c.-590_-589insTTA|||||589|,GTTA|upstream_gene_variant|MODIFIER|YNL226W|YNL226W|transcript|YNL226W_mRNA|protein_coding||c.-2184_-2183insTTA|||||2183|,GTTA|downstream_gene_variant|MODIFIER|CSL4|YNL232W|transcript|YNL232W_mRNA|protein_coding||c.*4254_*4255insTTA|||||4255|,GTTA|downstream_gene_variant|MODIFIER|JJJ1|YNL227C|transcript|YNL227C_mRNA|protein_coding||c.*601_*603dupTAA|||||603|,GTTA|downstream_gene_variant|MODIFIER|CNM67|YNL225C|transcript|YNL225C_mRNA|protein_coding||c.*2666_*2668dupTAA|||||2668|,GTTA|downstream_gene_variant|MODIFIER|SQS1|YNL224C|transcript|YNL224C_mRNA|protein_coding||c.*4738_*4740dupTAA|||||4740| GT:AD:DP:GQ 0/0/1:165,93:258:99"
Besides the part about the .vcf being moved, the other "offending lines" seem quite random.
Here's my dirty laundry, and again I am pretty new to all of this - I'm sure there are many errors in my approach, so I look forward to your input, and/or ideas about alternative workarounds.