mengyao / Complete-Striped-Smith-Waterman-Library

294 stars 112 forks source link

segfaults with linear gap penalties #76

Closed stuckyb closed 2 years ago

stuckyb commented 4 years ago

When aligning a large number of sequencing reads to a genomic reference, I get occasional segmentation faults when attempting to use linear gap penalties (i.e., opening penalty == extension penalty). If I set extension penalty < opening penalty, the problem disappears. I've also confirmed that certain sequences in the reads file always trigger the bug; i.e., it is not caused by sequence order. If you'd like to pursue this, let me know and I can provide you with sequence data that triggers the segfaults.

ksahlin commented 2 years ago

Thanks for the great library, I'm hoping to use it in one of my projects. To do that, It would be good with a fix to this issue. I'm seeing segfault not only for linear penalties but for the default penalties, that from reading the code I understand are 2,2,3,1.

Also, I should note that this is not a rare case, I observed it in all of my 15 datasets (containing 10M short reads each). I provide an example below.

When running example.cpp with the query and ref below I see the segmentation fault. Interestingly when I set the reference size a bit lower, it doesn't segfault (although the alignment is not good).

While in this case, I could tweak the score to remove the segfault, other cases might still fail, it would be good if this bug could be fixed.

  const string ref   = "ACCATTTTTGGTGACAGGGAGGATGACTCCCACCCCTTCCCTTCTCCTGCCAAGGGAGCATAGAAAAGGCTCAGCACTCTCTTGTTAGTCGGGGGACCACTCTAACCCTCCTGCTTGGTTCCAAACTCCATCCGTTCTCACCTGCACTTCTGCAAATAGCACCCTAACCAGACGCCTCACCTCCCTACAAACCATTCTTCCTCCTACAACCAGGACAATTTCCTGAAGGCAGAAGTCAGCACATATATATGTATATATATATATTTTCTTTTTTTTTTTTTTCTCTAGAGGACCGAATAACAAATATTTTAGGCTTTGCAAGACAAGAGGCAAAAATCAAAGATATAATGTAGGCACTTTCAAAACAAAAAAGAAAAAAATCAAAAAATGTTTATTGATAGAATTTAAAATATATTAATAATTGAGTACAATTTTGTCATAATACATGTAAATTTAATAAAAAGGAAGTAATTGTTTTGGAAGGAATAACATTTTGCTTAACTGGGATTCAAGTTAGTGTTTCCAATCCTCAGAACAATTGCAAATGTTAATTTGTTAATGCTGATCTGAAATGGAATTTTATGTTTCATCTTTGAAATTGTTCCTTGCACAATAGTTGACACTGCCAGTGACTACCCATGTACTTACATATATTGAAGCTCAAAATTATTTAAAGTTTTCAAGAGTTGTTACTATGCATTTTACAGTCATTGAGCCATTGAAAATTTGGGGTAACATGTCAGAAACAAGGTATAAAAACTGAAGTAATCACTTTGCACACCTGCTGATATAAACAGAAAATAAACGTGCTATTTTCAAAGTATGCTCAGTAAAGTGTCTTTGATAGAAAATAGGCTCAAAAGATCCTATTTGCAGAAACAATCAAAACATCAAAAATGGAAT";
  const string query = "GAATGATCTCAAATCCTCCACCGATTATTACTTCTCACAGAGTGCGTGAGCGTGTTAGGATGTTAACGTCTGTCTGTTTAGAGCACTGCATATGGGGTGCAAGTTTCTCAGCGTAATCACCTGAGATGCTTCCACCTACTGCAGAAACGTGCTATTTTCAAAGTATGCTCAGTAAGGTGTCTTTGATAGAAAATCGGCTC";

Segfaults:

  // Declares a default Aligner
  StripedSmithWaterman::Aligner aligner;
  // Declares a default filter
  StripedSmithWaterman::Filter filter;
  // Declares an alignment that stores the result
  StripedSmithWaterman::Alignment alignment;
  aligner.Align(query.c_str(), ref.c_str(), ref.size(), filter, &alignment, maskLen);

Doesn't segfault but poor alignment:

  // Declares a default Aligner
  StripedSmithWaterman::Aligner aligner;
  // Declares a default filter
  StripedSmithWaterman::Filter filter;
  // Declares an alignment that stores the result
  StripedSmithWaterman::Alignment alignment;
  aligner.Align(query.c_str(), ref.c_str(), ref.size()**-50**, filter, &alignment, maskLen);

Doesn't segfault good alignment (tweaking score):

  int match_score = 1;
  int mismatch_penalty = 4;
  int gap_opening_penalty = 6;
  int gap_extending_penalty = 1;
  StripedSmithWaterman::Aligner aligner(match_score, mismatch_penalty, gap_opening_penalty, gap_extending_penalty);
  // Declares a default filter
  StripedSmithWaterman::Filter filter;
  // Declares an alignment that stores the result
  StripedSmithWaterman::Alignment alignment;
  aligner.Align(query.c_str(), ref.c_str(), ref.size(), filter, &alignment, maskLen);
ksahlin commented 2 years ago

I did some tinkering and found that setting const int8_t score_size = 1; instead of 2 on two places in ssw_cpp.cpp fixes all my crashed instances.

So it seems that somehow the "I'm unsure about the score size"-flag of 2 doesn't do the job but instead segfaults.

mengyao commented 2 years ago

Thanks for the great library, I'm hoping to use it in one of my projects. To do that, It would be good with a fix to this issue. I'm seeing segfault not only for linear penalties but for the default penalties, that from reading the code I understand are 2,2,3,1.

Also, I should note that this is not a rare case, I observed it in all of my 15 datasets (containing 10M short reads each). I provide an example below.

When running example.cpp with the query and ref below I see the segmentation fault. Interestingly when I set the reference size a bit lower, it doesn't segfault (although the alignment is not good).

While in this case, I could tweak the score to remove the segfault, other cases might still fail, it would be good if this bug could be fixed.

  const string ref   = "ACCATTTTTGGTGACAGGGAGGATGACTCCCACCCCTTCCCTTCTCCTGCCAAGGGAGCATAGAAAAGGCTCAGCACTCTCTTGTTAGTCGGGGGACCACTCTAACCCTCCTGCTTGGTTCCAAACTCCATCCGTTCTCACCTGCACTTCTGCAAATAGCACCCTAACCAGACGCCTCACCTCCCTACAAACCATTCTTCCTCCTACAACCAGGACAATTTCCTGAAGGCAGAAGTCAGCACATATATATGTATATATATATATTTTCTTTTTTTTTTTTTTCTCTAGAGGACCGAATAACAAATATTTTAGGCTTTGCAAGACAAGAGGCAAAAATCAAAGATATAATGTAGGCACTTTCAAAACAAAAAAGAAAAAAATCAAAAAATGTTTATTGATAGAATTTAAAATATATTAATAATTGAGTACAATTTTGTCATAATACATGTAAATTTAATAAAAAGGAAGTAATTGTTTTGGAAGGAATAACATTTTGCTTAACTGGGATTCAAGTTAGTGTTTCCAATCCTCAGAACAATTGCAAATGTTAATTTGTTAATGCTGATCTGAAATGGAATTTTATGTTTCATCTTTGAAATTGTTCCTTGCACAATAGTTGACACTGCCAGTGACTACCCATGTACTTACATATATTGAAGCTCAAAATTATTTAAAGTTTTCAAGAGTTGTTACTATGCATTTTACAGTCATTGAGCCATTGAAAATTTGGGGTAACATGTCAGAAACAAGGTATAAAAACTGAAGTAATCACTTTGCACACCTGCTGATATAAACAGAAAATAAACGTGCTATTTTCAAAGTATGCTCAGTAAAGTGTCTTTGATAGAAAATAGGCTCAAAAGATCCTATTTGCAGAAACAATCAAAACATCAAAAATGGAAT";
  const string query = "GAATGATCTCAAATCCTCCACCGATTATTACTTCTCACAGAGTGCGTGAGCGTGTTAGGATGTTAACGTCTGTCTGTTTAGAGCACTGCATATGGGGTGCAAGTTTCTCAGCGTAATCACCTGAGATGCTTCCACCTACTGCAGAAACGTGCTATTTTCAAAGTATGCTCAGTAAGGTGTCTTTGATAGAAAATCGGCTC";

Segfaults:

  // Declares a default Aligner
  StripedSmithWaterman::Aligner aligner;
  // Declares a default filter
  StripedSmithWaterman::Filter filter;
  // Declares an alignment that stores the result
  StripedSmithWaterman::Alignment alignment;
  aligner.Align(query.c_str(), ref.c_str(), ref.size(), filter, &alignment, maskLen);

Doesn't segfault but poor alignment:

  // Declares a default Aligner
  StripedSmithWaterman::Aligner aligner;
  // Declares a default filter
  StripedSmithWaterman::Filter filter;
  // Declares an alignment that stores the result
  StripedSmithWaterman::Alignment alignment;
  aligner.Align(query.c_str(), ref.c_str(), ref.size()**-50**, filter, &alignment, maskLen);

Doesn't segfault good alignment (tweaking score):

  int match_score = 1;
  int mismatch_penalty = 4;
  int gap_opening_penalty = 6;
  int gap_extending_penalty = 1;
  StripedSmithWaterman::Aligner aligner(match_score, mismatch_penalty, gap_opening_penalty, gap_extending_penalty);
  // Declares a default filter
  StripedSmithWaterman::Filter filter;
  // Declares an alignment that stores the result
  StripedSmithWaterman::Alignment alignment;
  aligner.Align(query.c_str(), ref.c_str(), ref.size(), filter, &alignment, maskLen);

Thank you so much for these example sequences. I finally got time to work on this error now. Please let me know, if you can find even shorter sequences that can trigger this memory error. The shorter the easier for me to debug.

Mengyao