Open GoogleCodeExporter opened 9 years ago
Hi all-
Just a comment on the floating exceptions we've gotten from Samtools. I think
in this case, and perhaps in a few of the other instances when the error showed
up as well, it may be related to the header lines in our reference genome FASTA
(.fa) file.
See this thread on the samtools mailing list for a description of the problem:
http://sourceforge.net/mailarchive/forum.php?thread_name=Pine.LNX.4.64.110929082
6400.21023%40defender.gpcc.itd.umich.edu&forum_name=samtools-help
After reading this thread, I made a quick edit to the headers of
/projects1/Reference_genomes/hg18_ensembl/gatk-canonical/gatk-hg18_ensembl.fa
to eliminate the redundant text, then tried a few of the problematic samtools
commands again. I didn't get any floating point exceptions this time. Federica
and I will continue testing samtools in various places of the pipelines to make
sure we've truly eradicated the issue, but wanted to share our finding.
Also, the edited/corrected reference FASTA file I mentioned is here:
/projects1/Reference_genomes/hg18_ensembl/gatk-canonical/gatk-hg18_ensembl.corre
cted.fa
Original comment by apcexcha...@gmail.com
on 5 Oct 2011 at 8:18
Thanks for the info. I'm still waiting to be able to test this workflow on my
fgene account.
Looking at the samtools documentation, it looks like the samtools pileup
command is deprecated, replaced by mpileup. Also, their documented protocol for
generating a vcf from a bam is very different:
samtools mpileup -uf ref.fa aln1.bam aln2.bam | bcftools view -bvcg - >
var.raw.bcf
bcftools view var.raw.bcf | vcfutils.pl varFilter -D100 > var.flt.vcf
I got that from http://samtools.sourceforge.net/mpileup.shtml
whereas the current workflow just uses:
samtools pileup -vcf ref.fa subject.bam > subject.raw.vcf
Should we switch from pileup to mpileup and change the protocol to the new one
to avoid using deprecated tools? Assuming i can find and build bcftool and
vcfutils, i could could together a wrapper script for this new protocol pretty
quickly.
Any thoughts?
Alex
Original comment by alexge...@gmail.com
on 6 Oct 2011 at 6:29
Yes, I think this may be really helpful...the conversion in vcf with the olp
pileup command is still running (1 day and more) and this timing is not
feasible.
On the annovar website they are not up to date, they are reporting still the
pileup command..I am writing to the ANNOVAR developer to know if ANNOVAR is
compatible with the vcf file created by the new features in samtools.
Even if it is not directly compatible there is a feature in ANNOVAR that
converts vcf files in VCF4 files
(http://www.openbioinformatics.org/annovar/annovar_input.html)...so, in summary
i think that using the newer options is a good idea.
Federica
Original comment by federica...@gmail.com
on 6 Oct 2011 at 6:46
Yes, is compatible so I think you can go ahead. I will update the documentation
accordingly.
federica
Original comment by federica...@gmail.com
on 6 Oct 2011 at 6:56
Sorry it seems like we go through this step with every workflow, but can you
point me to some relevant data as input for the first samtools mpileup step? it
takes two alignment files and a reference fasta file, as well as an integer for
the maximum read depth (I'm using 100).
I've put together the workflow, but it fails, and I want to make sure the
problem isn't in the data i'm using.
Alex
Original comment by alexge...@gmail.com
on 10 Oct 2011 at 9:06
Hi Alex,
sorry for my late response but I am back now from Montreal. No problem, you are
right.
It takes only TWO files in total:
-the alignment file: .bam file
(example:/projects2/USC/canonical-test-data/hg18_ensembl-bwa/mini-bam/s_1.sorted
.bam)
-and the reference genome in fasta format:
( example:
/projects1/Reference_genomes/hg18_ensembl/gatk-canonical/gatk-hg18_ensembl.fa)
Let me know how it goes and if you have any other questions :)
fed
Original comment by federica...@gmail.com
on 17 Oct 2011 at 11:43
Ok, I've run it with this data and it fails on the convert2annovar.pl step:
NOTICE: the default --format argument is set as 'pileup'
NOTICE: the default --snpqual argument for pileup format is set as 20
NOTICE: Column 6-9 in output are heterozygosity status, SNP quality, total
reads, reads with mutation
Error: invalid record found in pileupfile
/projects/pipelineCache/pipeline/2011October18_12h11m34s468ms/SamToolsBamtoVCF_1
.OutputVCF-1.vcf (at least 10 fields expected): <##fileformat=VCFv4.1>
I've attached the workflow if you want to try it. Any ideas?
Alex
Original comment by alexge...@gmail.com
on 18 Oct 2011 at 7:17
Attachments:
We are looking at that right now. Seems that this mpileup gives some issues...
Original comment by federica...@gmail.com
on 21 Oct 2011 at 12:16
Original issue reported on code.google.com by
federica...@gmail.com
on 4 Oct 2011 at 7:59Attachments: