walaj / svaba

Structural variation and indel detection by local assembly
GNU General Public License v3.0
237 stars 45 forks source link

Challenge of running svaba with FFPE tumor/normal pairs #38

Open ipstone opened 6 years ago

ipstone commented 6 years ago

Svaba is awesome! For frozen WGS samples, we got a genome done within 4-5 hours with low CPU/memory footage.

However, as we run tumor-normal pairs on FFPE samples,it is taking extra long time and not finishing. Is there a way to deal with the FFPE tumor-pair samples using svaba? Suggestions are really appreciated.

Here are some of the log information I have: The first number is the line number of the sample.log:

930147 Ran GL000192.1:    441,001-    466,001 | T:  2856 N:   127 C:    94 | R: 49% M:  0% T:  5% C: 12% A: 31% P:  0% | CPU: 10282m53s Wall: 1326m25s
930148 ...BFC attempted correct 5314 train: 25402 kcov: 80.874680 kmer: 17
930149 ...assembled 227 contigs for c_84_465501_490501
930150 ...BFC attempted correct 6433 train: 33261 kcov: 100.696404 kmer: 17
930151 writing contigs etc on thread 47362852042496 with limit hit of 7520
930152 Ran GL000192.1:    465,501-    490,501 | T:  4418 N:   895 C:   159 | R: 51% M:  0% T:  6% C: 13% A: 27% P:  0% | CPU: 10283m01s Wall: 1326m27s
930153 ...assembled 268 contigs for c_84_490001_515001
930154 ...assembled 307 contigs for c_84_514501_539501
930155 writing contigs etc on thread 47362854143744 with limit hit of 7295
930156 Ran GL000192.1:    490,001-    515,001 | T:  4677 N:   637 C:   192 | R: 55% M:  0% T:  5% C: 12% A: 25% P:  0% | CPU: 10283m10s Wall: 1326m29s
930157 writing contigs etc on thread 47362839435008 with limit hit of 10967
930158 Ran GL000192.1:    514,501-    539,501 | T:  5463 N:   970 C:   203 | R: 60% M:  0% T:  4% C:  8% A: 25% P:  0% | CPU: 10283m11s Wall: 1326m29s
930159 writing contigs etc on thread 47362847840000 with limit hit of 20518
930160 ...assembled 1199 contigs for c_83_171501_196501
930161 Ran GL000225.1:    196,001-    211,173 | T:  7289 N:  4674 C:   388 | R: 53% M:  0% T:  7% C: 10% A: 27% P:  0% | CPU: 10283m14s Wall: 1326m31s
930162 writing contigs etc on thread 47362841536256 with limit hit of 51970
930163 Ran GL000225.1:    171,501-    196,501 | T: 12692 N:  8323 C:   732 | R: 58% M:  0% T:  6% C: 15% A: 18% P:  0% | CPU: 10283m24s Wall: 1326m41s

The error log of the job (the one killed before) looks like this:

...
21858     stopping read lookup at 83:85,871(+) in window 83:73,501-98,501(*) with 50,001 weird reads. Limit: 50,000
21859     stopping read lookup at 83:50,907(+) in window 83:49,001-74,001(*) with 50,001 weird reads. Limit: 50,000
21860     stopping read lookup at 83:131,670(-) in window 83:122,501-147,501(*) with 50,001 weird reads. Limit: 50,000
21861     stopping read lookup at 83:102,433(-) in window 83:98,001-123,001(*) with 50,001 weird reads. Limit: 50,000
21862     stopping read lookup at 83:126,115(+) in window 83:122,501-147,501(*) with 50,001 weird reads. Limit: 50,000
21863     stopping read lookup at 83:82,785(+) in window 83:73,501-98,501(*) with 50,001 weird reads. Limit: 50,000
walaj commented 6 years ago

Thanks! Glad it's usually fast. When it's not, I first suspect that there are a lot of discordant reads, which is causing it to do a lot of mate-region look-ups. It could also be that there are just a lot of low-quality reads with enough mismatches that it is causing a lot of clipping or unmapped reads, in which case the assembler is taking more time because it is assembling so many reads.

You might try the minimum number of discordant reads needed to trigger a mate-region lookup. This can be set with -L (introduced < 1 year ago, so you need newer version). Seting -L 100000 effectively turns off mate-region lookup, which isn't terrible it just decreases the number of reads that get assembled together for true variants. You could also change the maximum number of reads that can be assembled at once (-x). The default is 50,000, but this is a very very high number of reads and hitting that limit usually means that you're in a very difficult territory to map.

Also the regions you're showing me are in the non-chromosomal regions of the reference. I usually just skip these. You can set the regions -k to be a BED file defining just the mappable, chromosomal regions of the genome.