statisticalbiotechnology / maracluster

Matthew The's implementation of MaRaCluster
Apache License 2.0
11 stars 3 forks source link

Infinite while loop prevents maracluster from progressing #17

Closed TimothyOlsson closed 5 years ago

TimothyOlsson commented 5 years ago

Hi Matthew,

I noticed a bug where a large clusterSize in BatchSpectrumClusters.cpp prevents maracluster from progressing.

I ran a data set from cyanobacteria, 10 samples with each file about 300 mb in size. After running quandenser on the data set, I noticed that it froze when it ran maracluster. I debugged the code and found out that it froze on this specific line:

Line 209: BatchSpectrumClusters.cpp while (clusterSize >>= 1 && clusterSizeBin < 9) ++clusterSizeBin;

When the variable "clusterSize" reaches above and equal to 512, the while loop gets stuck forever.

Copy the following code into this website and run it to verify the problem: https://repl.it/languages/cpp

#include <iostream>
#include <chrono>
#include <thread>

int main() {
  std::cout << "Started with clusterSize 511. This will work" << std::endl;  // Prints out start
  size_t clusterSizeBin = 0u;  // Exactly as in BatchSpectrumClusters.cpp
  size_t clusterSize = 511;  // Any value over 512 will make the while loop get stuck. This will be fine
  while (clusterSize >>= 1 && clusterSizeBin < 9) {
    ++clusterSizeBin;
    // The part above should be exactly the same as:
    // while (clusterSize >>= 1 && clusterSizeBin < 9) ++clusterSizeBin;
    std::this_thread::sleep_for(std::chrono::seconds(1));  // wait 1 sec
    std::cout << "clusterSizeBin = " << clusterSizeBin << std::endl;
    std::cout << "clusterSize = " << clusterSize << std::endl;
    // Shift bits right is almost as x/2
  }
  std::cout << "Done" << std::endl;

  std::cout << "Started with clusterSize 512. This will get stuck forever" << std::endl; 
  clusterSizeBin = 0u;  // Restart
  clusterSize = 512;  // This will get stuck forever
  while (clusterSize >>= 1 && clusterSizeBin < 9) {
    ++clusterSizeBin;
    std::this_thread::sleep_for(std::chrono::seconds(1));  // wait 1 sec to not crash
    std::cout << "clusterSizeBin = " << clusterSizeBin << std::endl;
    std::cout << "clusterSize = " << clusterSize << std::endl;
  }
  std::cout << "Done" << std::endl;  // We will never reach this
}

If I have understood what the problematic line is trying to accomplish aka shift clusterSize bits right by one until clusterSize == 1 or until clusterSizeBin is above or equal to 9.

In that case, I suggest rewriting the line to something similar to this:

while (clusterSize >>= 1) {
  ++clusterSizeBin;
  if (clusterSizeBin >= 9) {
    break;
  }
}

This will prevent the bug from happening. I tested it briefly with my data set and it passed through maracluster cleanly.

Best regards,

Timothy Bergström

TimothyOlsson commented 5 years ago

Note: This affects this tree used for quandenser: https://github.com/statisticalbiotechnology/maracluster/tree/f4d48a96b80b45cb511af7c854415f049057ed51

In maracluster 1.0, the file has been renamed to "SpectrumClusters.cpp". The bug should be valid there as well.

MatthewThe commented 5 years ago

Thanks, you're completely right. Could you submit a pull request to the master branch with your fix?

MatthewThe commented 5 years ago

Actually, the problem was simply some missing parentheses, so I fixed it myself.

TimothyOlsson commented 5 years ago

Thanks, you're completely right. Could you submit a pull request to the master branch with your fix?

Since Quandenser uses a detached head, I added a branch of maracluster which adds the fix. https://github.com/statisticalbiotechnology/maracluster/tree/quandenser_branch

Perhaps it is possible to use the branch as a submodule for quandenser instead of the detached head to incorporate the fix. However, I'm not sure how to do it in github.