walaj / VariantBam

Filtering and profiling of next-generational sequencing data using region-specific rules
Other
74 stars 10 forks source link

VCF parsing error on extra linked regions with long ANN key/value in INFO field #10

Closed chapmanb closed 7 years ago

chapmanb commented 7 years ago

Jeremiah; We've been using the awesome linked-region functionality in VariantBam to extract regions supporting structural variant breakends. We ran into an issue using this on larger regions with big ANN fields (from snpEff):

terminate called after throwing an instance of 'std::invalid_argument'
  what():  stoi
run.sh: line 5: 29375 Aborted                 (core dumped) variant Test1-sort.bam -l variant-ann-problem.vcf -o manta_mini.bam -b

Removing the ANN and SIMPLE_ANN fields from the VCF enable it to work cleanly. I put together a small test set that demonstrates the problem and workaround (a test BAM file is included but it doesn't matter which you use):

https://s3.amazonaws.com/chapmanb/testcases/variantbam_ann.tar.gz

Is this due to ANN field size or some element of the value itself. Happy to try to pre-process (in ways other than removing all annotations) if it would help. Thanks much.

walaj commented 7 years ago

You picked on it exactly, it's a buffer overflow problem with the VCF parsing. I had a limit of 4096 characters per VCF line, which is too low. I increased that by 16-fold and that indeed solved the problem. It only ever holds one line in memory anyway just to parse out the chromosome and position, so a larger buffer is reasonable.

Thanks for including the minimal working example, it's immensely helpful when debugging. Good to know too that the linked-reads is useful for you. We've used it for extracting read pairs around SV breakpoints as well.

chapmanb commented 7 years ago

Jeremiah; Perfect, thank you -- that fix works great and all is working now with our real datasets. Thanks again for the quick turnaround.