hall-lab / speedseq

A flexible framework for rapid genome analysis and interpretation
MIT License
315 stars 117 forks source link

freebayes runaway process #86

Closed jgbaum closed 8 years ago

jgbaum commented 8 years ago

I'm running into an issue when my input fastq files are extremely large (330G normal, 866G tumor) that produce correspondingly large BAM files (85G normal, 160G tumor). The speedseq align step seems to complete without issue. However, when speedseq somatic is run on these bam files, output seems to be generated for a few hours and then one of the freebayes processes keeps computing for days without printing any output. The process is using CPU, so it doesn't appear to be hung, but I'm concerned that it will never complete.

The particular command that is running, as I can see from a 'ps' is:

/mnt/BioApps/speedseq/20160708//bin/freebayes -f /share/apps/bwa/bwa_references/GRCh37.fa --pooled-discrete --genotype-qualities --min-repeat-entropy 1 --min-alternate-fraction 0.05 --min-alternate-count 2 --region 1:0..249250621 /mnt/BioAdHoc/Groups/Peters/Schoenberger_exome_seq/pipeline_runs/Hu_001_RMLO_other/002-exome_run_speedseq/Hu_001-RMLO_normal.bam /mnt/BioAdHoc/Groups/Peters/Schoenberger_exome_seq/pipeline_runs/Hu_001_RMLO_other/002-exome_run_speedseq/Hu_001-RMLO_tumor.bam | awk -v ONLY_SOMATIC="0" -v MINQUAL="1" -v SSC_THRES="0" 'BEGIN {NORMAL=10; TUMOR=11; GL_IDX=0;} { if ($0~"^#") { print ; next; } if (! GL_IDX) { split($9,fmt,":") ; for (i=1;i<=length(fmt);++i) { if (fmt[i]=="GL") GL_IDX=i } } split($NORMAL,N,":"); split(N[GL_IDX],NGL,","); split($TUMOR,T,":"); split(T[GL_IDX],TGL,","); LOD_NORM=NGL[1]-NGL[2]; LOD_TUMOR_HET=TGL[2]-TGL[1]; LOD_TUMOR_HOM=TGL[3]-TGL[1]; if (LOD_TUMOR_HET > LOD_TUMOR_HOM) { LOD_TUMOR=LOD_TUMOR_HET } else { LOD_TUMOR=LOD_TUMOR_HOM } DQUAL=LOD_TUMOR+LOD_NORM; if (DQUAL>=SSC_THRES && $NORMAL~"^0/0") { $7="PASS" ; $8="SSC="DQUAL";"$8 ; print } else if (!ONLY_SOMATIC && $6>=MINQUAL && $10~"^0/0" && ! match($11,"^0/0")) { $8="SSC="DQUAL";"$8 ; print } }' OFS="?" > Hu_001_RMLO_other.QtRi7pNjAEAB/Hu_001_RMLO_other.1:0..249250621.vcf

This corresponds to the first command in the 'var_commands.txt' file. It seems that all of the other commands have already completed and produced output in the temp directory. The first time we tried to run this sample through, the job ran for a little over 2 weeks without producing any output, all the while using CPU. However, our cluster went down and we were forced to restart. The job has now been running for a couple of days and this step is once again not producing any output.

Any suggestions on pushing this job through? Perhaps somehow breaking up the freebayes job into smaller ones? Or will I need to downsample the reads? Something else entirely?

Any input is appreciated. I'd like to add that I've been a very happy user of speedseq for over a year and have run many samples successfully through the pipeline in question...although none were quite this massive!

Happy to any information you might require to help resolve. Thank you.

jgbaum commented 8 years ago

Any help on this would be much appreciated as it still seems to be running and is yet unresolved. Thanks much!

cc2qe commented 8 years ago

Perhaps try running with the -w annotations/ceph18.b37.include.2014-01-15.bed flag in speedseq somatic. This breaks the human genome (build 37) into regions of approximately even depth.

Note that it excludes regions that are known to have excessive coverage in a panel of normal individuals (presumably due to reference assembly artifacts). If you must include all genomic regions, you can create a BED file of smaller genomic regions for parallel somatic variant calling.

I'm not sure why only chromosome 1 would have an issue with your current scheme, but breaking up should help identify which genomic region is causing problems.

jgbaum commented 8 years ago

This fixed our issue! Thank you very much!