waynebhayes / SANA

Simulating Annealing Network Aligner
25 stars 40 forks source link

Fixed pBad bug "Too few Samples for Linear Regression" #139

Closed Reevest29 closed 3 years ago

Reevest29 commented 3 years ago

Implementation Details

waynebhayes commented 3 years ago

I agree with both of Nil's comments.

On Tue, Dec 22, 2020 at 6:17 PM Nil Mamano notifications@github.com wrote:

@nmamano commented on this pull request.

The changes look good. I have 2 comments.

1.

I think there are 2 different constants, which both happen to be 4. One constant is how many samples the Linear Regression utility class needs in order to work correctly. You already named this accurately: "LinearRegression::MIN_NUM_SAMPLES_REQUIRED" (we usually use caps for constants). The other constant is how many extra samples the LinearRegressionVintage method takes near each "knee" of the sigmoid function. Currently, we take 4 at each end. We should call this something like "LinearRegressionVintage::EXTRA_SAMPLES". Even though both constants are 4, it seems to me like the two could change independently/for different reasons, so the changes in one could inadvertently affect the other. 2.

In the line "if (pBadMap.size() < min_samples) {" it is not clear how this could ever be true, since it follows a loop where we add that many elements to the pBadMap map. In fact, the code shows that pBadMap should have at least size 8. Do we understand why this "if" is necessary? If yes, it would be nice to add a comment explaining why. If not, we should make sure that the preceding code is working as intended. Maybe the binary searches are getting stuck at a repeated value and repeatedly overwriting the same key in the pBadMap? It's just a theory, but if we don't understand the underlying issue, the new code will prevent LinearRegressionVintage from crashing but it might still not work properly. (also, if LinearRegressionVintage is sampling the same temperature multiple times and that is intended, then pBadMap should be a multimap instead of a map).

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/nmamano/SANA/pull/139#pullrequestreview-557269449, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE2SJDCA5II6AE43SM7XS33SWDOPJANCNFSM4VF5LXOQ .

-- Wayne B. Hayes, Ph.D. Associate Professor of Computer Science, University of California, Irvine “The car stopped on Thirty-fourth Street, in front of the escalators leading down to the station. When the car door opened and he stepped out, somebody exclaimed "Holy shit! It's Mister Fucking Rogers!" This was not a bad thing, however, because he was in New York, and in New York it's not an insult to be called Mister Fucking Anything.”—Tom Junod, “Can you say... Hero”? Esquire Magazine, November 1998 https://www.esquire.com/entertainment/tv/a27134/can-you-say-hero-esq1198/.

Reevest29 commented 3 years ago

For the first issue, I see the mistake now. It should be fixed.

Regarding the second comment:

From what I understand, the crash we were observing stems from the bounds of the temperature search range being set to equivalent values. As a result, binary search is thrown off since it's searching an interval of length 0. This causes the for-loops for increasing sample density to unintentionally overwrite the same value repeatedly (like you theorized), leaving the pBadMap with only 1 sample.

Currently, I have it so that if for any reason the pBadMap has only 1 value we increase the number of samples so that linear regression won’t fail. Instead, I could make it so that when the binary search starts, if the upper and lower bounds are equivalent it skips the sample density loops (since those would fail anyway) and adds samples around the current temperature.

Let me know my thinking was incorrect, or if anything else needs changing.

The changes look good. I have 2 comments.

  1. I think there are 2 different constants, which both happen to be 4. One constant is how many samples the Linear Regression utility class needs in order to work correctly. You already named this accurately: "LinearRegression::MIN_NUM_SAMPLES_REQUIRED" (we usually use caps for constants). The other constant is how many extra samples the LinearRegressionVintage method takes near each "knee" of the sigmoid function. Currently, we take 4 at each end. We should call this something like "LinearRegressionVintage::EXTRA_SAMPLES". Even though both constants are 4, it seems to me like the two could change independently/for different reasons, so the changes in one could inadvertently affect the other.
  2. In the line "if (pBadMap.size() < min_samples) {" it is not clear how this could ever be true, since it follows a loop where we add that many elements to the pBadMap map. In fact, the code shows that pBadMap should have at least size 8. Do we understand why this "if" is necessary? If yes, it would be nice to add a comment explaining why. If not, we should make sure that the preceding code is working as intended. Maybe the binary searches are getting stuck at a repeated value and repeatedly overwriting the same key in the pBadMap? It's just a theory, but if we don't understand the underlying issue, the new code will prevent LinearRegressionVintage from crashing but it might still not work properly. (also, if LinearRegressionVintage is sampling the same temperature multiple times and that is intended, then pBadMap should be a multimap instead of a map).
nmamano commented 3 years ago

Looks good to me! I think we should still print a warning to the terminal when the bounds of the temperature search range are set to the same value since I think it's not supposed to happen.

Reevest29 commented 3 years ago

Looks good to me! I think we should still print a warning to the terminal when the bounds of the temperature search range are set to the same value since I think it's not supposed to happen.

Added!