hall-lab / speedseq

A flexible framework for rapid genome analysis and interpretation
MIT License
311 stars 116 forks source link

Question about speedseq var #85

Closed JiongZ closed 7 years ago

JiongZ commented 8 years ago

When I use speedseq var on my bam file of a whole genome, it seems going to take forever to finish. I stopped it after 43 hrs, a vcf.gz file was made and the vcf format looks decent. Below is the code I used:

sudo sh -c " speedseq var -t 24 -o $working_dir/WGC083268D.vcf.gz $ref_dir/human_g1k_v37.fasta $working_dir/WGC083268D.bam > $working_dir/var.log" &

I also ran it with "-w $annotations/ceph18.b37.include.2014-01-15.bed". And this also not finished yet...

At first it ran with 24 threads, however, after few hours only 2 threads were occupied. Could this be the problem? How should I fix it?

Thanks

cc2qe commented 8 years ago

That sounds like it could be a problem. A few questions:

I'm wondering whether it's not appropriately calling "parallel." Thanks for posting this

Side note: you should use -o $working_dir/WGC083268D rather than -o $working_dir/WGC083268D.vcf.gz, since it should be a prefix

JiongZ commented 8 years ago
  1. The average depth is about 25
  2. The temporary directory contains multiple .vcf files (one for each chrome)
  3. And here is the output: Sourcing executables from /home/light/bin/speedseq/bin/speedseq.config ... Calling variants...

    create temporary directory
    
    /home/light/bin/speedseq//bin/sambamba view -H /media/light/wgs/process/WGC083268D_combined.bam | grep "^@SQ" | cut -f 2- | awk '{ gsub("^SN:","",$1); gsub("^LN:","",$2); print $1"\t0\t"$2; }' > WGC083268D.vcf.gz.nLQKVcMfuVxn/windows.bed
    
    /home/light/bin/speedseq//bin/freebayes \
       -f /media/light/sample/ref/human_g1k_v37.fasta \
       --region $chrom:$start..$end \
       /media/light/wgs/process/WGC083268D_combined.bam \
       --min-repeat-entropy 1 \
       | /home/light/bin/speedseq//bin/vawk --header '$6>=1 && I$RPR>0 && I$RPL>0' \
       > WGC083268D.vcf.gz.nLQKVcMfuVxn/WGC083268D.vcf.gz.$i.vcf
    
    cat WGC083268D.vcf.gz.nLQKVcMfuVxn/var_command.txt | /home/light/bin/speedseq//bin/parallel -j 24
    
    grep "^#" WGC083268D.vcf.gz.nLQKVcMfuVxn/WGC083268D.vcf.gz.1:0..249250621.vcf > WGC083268D.vcf.gz.nLQKVcMfuVxn/header.txt
    
       cat WGC083268D.vcf.gz.nLQKVcMfuVxn/WGC083268D.vcf.gz.":$start..$end".vcf | grep -v "^#" \
           | sort -k1,1 -k2,2n | cat WGC083268D.vcf.gz.nLQKVcMfuVxn/header.txt - \
           | /home/light/bin/speedseq//bin/bgzip -c > /media/light/wgs/process/WGC083268D.vcf.gz.vcf.gz

    /home/light/bin/speedseq//bin/tabix -f -p vcf /media/light/wgs/process/WGC083268D.vcf.gz.vcf.gz

    rm -r WGC083268D.vcf.gz.nLQKVcMfuVxn

    Done

Thanks

cc2qe commented 8 years ago

Ok, based on that output you showed, it looked like speedseq successfully completed. The data should be located at /media/light/wgs/process/WGC083268D.vcf.gz.vcf.gz

As for the earlier question, by default, speedseq parallelizes by chromosome, which have different lengths and therefore different processing times. Some will finish faster than others but if there are 2 threads active after 43 hours then there should be 2 .vcf files in the temporary directory that continue to be updated before they are merged into the final file.

JiongZ commented 8 years ago

Thanks a lot for your help.

There is another problem. In the ID column of my vcf file, no identifier was given to any variants.I have annotated the vcf file with both VEP and annovar, but only annovar can provide rs number to the variants.

My question is: Is there any method to add the rs number to the vcf file?

cc2qe commented 8 years ago

SpeedSeq doesn't currently have a function to annotate variants with rs_ids, but I think it can be done in GATK. Perhaps annovar can do it too

arq5x commented 8 years ago

check out vcfanno (https://github.com/brentp/vcfanno) from @brentp.