Question about speedseq var

JiongZ commented 8 years ago

When I use speedseq var on my bam file of a whole genome, it seems going to take forever to finish. I stopped it after 43 hrs, a vcf.gz file was made and the vcf format looks decent. Below is the code I used:

sudo sh -c " speedseq var -t 24 -o $working_dir/WGC083268D.vcf.gz $ref_dir/human_g1k_v37.fasta $working_dir/WGC083268D.bam > $working_dir/var.log" &

I also ran it with "-w $annotations/ceph18.b37.include.2014-01-15.bed". And this also not finished yet...

At first it ran with 24 threads, however, after few hours only 2 threads were occupied. Could this be the problem? How should I fix it?

Thanks

cc2qe commented 8 years ago

That sounds like it could be a problem. A few questions:

What is the depth of your BAM file?
Does speedseq create a temporary directory and does it contain multiple .vcf files (one for each chrom by default, or one for each window with ceph18.b37.include.2014-01-15.bed?
Could you run it with the "-v" (verbose) option and show the output here?

I'm wondering whether it's not appropriately calling "parallel." Thanks for posting this

Side note: you should use -o $working_dir/WGC083268D rather than -o $working_dir/WGC083268D.vcf.gz, since it should be a prefix

JiongZ commented 8 years ago

The average depth is about 25
The temporary directory contains multiple .vcf files (one for each chrome)

And here is the output: Sourcing executables from /home/light/bin/speedseq/bin/speedseq.config ... Calling variants...

create temporary directory

/home/light/bin/speedseq//bin/sambamba view -H /media/light/wgs/process/WGC083268D_combined.bam | grep "^@SQ" | cut -f 2- | awk '{ gsub("^SN:","",$1); gsub("^LN:","",$2); print $1"\t0\t"$2; }' > WGC083268D.vcf.gz.nLQKVcMfuVxn/windows.bed

/home/light/bin/speedseq//bin/freebayes \
   -f /media/light/sample/ref/human_g1k_v37.fasta \
   --region $chrom:$start..$end \
   /media/light/wgs/process/WGC083268D_combined.bam \
   --min-repeat-entropy 1 \
   | /home/light/bin/speedseq//bin/vawk --header '$6>=1 && I$RPR>0 && I$RPL>0' \
   > WGC083268D.vcf.gz.nLQKVcMfuVxn/WGC083268D.vcf.gz.$i.vcf

cat WGC083268D.vcf.gz.nLQKVcMfuVxn/var_command.txt | /home/light/bin/speedseq//bin/parallel -j 24

grep "^#" WGC083268D.vcf.gz.nLQKVcMfuVxn/WGC083268D.vcf.gz.1:0..249250621.vcf > WGC083268D.vcf.gz.nLQKVcMfuVxn/header.txt

   cat WGC083268D.vcf.gz.nLQKVcMfuVxn/WGC083268D.vcf.gz.":$start..$end".vcf | grep -v "^#" \
       | sort -k1,1 -k2,2n | cat WGC083268D.vcf.gz.nLQKVcMfuVxn/header.txt - \
       | /home/light/bin/speedseq//bin/bgzip -c > /media/light/wgs/process/WGC083268D.vcf.gz.vcf.gz

/home/light/bin/speedseq//bin/tabix -f -p vcf /media/light/wgs/process/WGC083268D.vcf.gz.vcf.gz

rm -r WGC083268D.vcf.gz.nLQKVcMfuVxn

Done

Thanks

cc2qe commented 8 years ago

Ok, based on that output you showed, it looked like speedseq successfully completed. The data should be located at /media/light/wgs/process/WGC083268D.vcf.gz.vcf.gz

As for the earlier question, by default, speedseq parallelizes by chromosome, which have different lengths and therefore different processing times. Some will finish faster than others but if there are 2 threads active after 43 hours then there should be 2 .vcf files in the temporary directory that continue to be updated before they are merged into the final file.

JiongZ commented 8 years ago

Thanks a lot for your help.

There is another problem. In the ID column of my vcf file, no identifier was given to any variants.I have annotated the vcf file with both VEP and annovar, but only annovar can provide rs number to the variants.

My question is: Is there any method to add the rs number to the vcf file?

cc2qe commented 8 years ago

SpeedSeq doesn't currently have a function to annotate variants with rs_ids, but I think it can be done in GATK. Perhaps annovar can do it too

arq5x commented 8 years ago

check out vcfanno (https://github.com/brentp/vcfanno) from @brentp.

hall-lab / speedseq

Question about speedseq var #85