pritykinlab / guidescan-cli

A gRNA database generation tool.
http://www.guidescan.com
13 stars 5 forks source link

Bug fix for specificity calculation tweaks introduced in PR 14 #18

Closed vineetbansal closed 1 year ago

vineetbansal commented 1 year ago

In PR #14, I introduced a tweak in the way we calculate specificity in the case of no matches at all. This is related to the discussion we had on Slack (where you were seeing discrepancies between your script vs. what guidescan enumerate gave you):

Ok so the python code is doing (1 / 1 + cfd_sum) in case there are no matches, and 1 / cfd_sum if there is at least one match.
Let me put in the exact same logic in C++

During an unrelated investigation this week, I discovered some discrepancies between the specificity values that are written in the case of a .csv output vs a .sam output, and noticed 2 bugs that I introduced in that PR:

  1. perfect_match = match.mismatches == 0; is not the correct thing to do, since we want to determine if there is a perfect match or not (and not overwrite that flag in subsequent match comparisons of distance >= 0).
  2. The specificity tweak only worked anyway in the case of generating the .csv file output, not the .sam file. Whatever logic was added to the .csv file generation should also have been added to the .sam file generation.

This PR fixes both these issues.

It was also getting a bit hairy to debug .sam files (by doing a simple md5sum on what an expected SAM file is vs. what is generated, for example), because we're doing (only for .sam files, not for .csv files):

std::shuffle(off_targets[i].begin(), off_targets[i].end(), std::mt19937{std::random_device{}()});

I've taken off this line so we get deterministic .sam files now if the inputs haven't changed. This is helpful for my debugging (especially comparing off-target hex strings to what they should be). If this shuffling was put for some specific reason (or if it should be moved to its own issue/PR), I'm happy to take that off here.