waynebhayes / SANA

Simulating Annealing Network Aligner
25 stars 39 forks source link

Core scores are now printed to .naf instead of console #132

Closed shu-g closed 4 years ago

shu-g commented 4 years ago

Added saveCoreScores() method to Report class which is called by SANA::run() if CORES is defined. Made TrimCoreScores public and static. Added pwpBad(pairwise), 1 - pwpBad in addition to original weights. Modified .gitignore to include *.naf

To test, uncomment -DCORES in Makefile. Have tried to stick to style guide / surrounding code style.

shu-g commented 4 years ago

Sorry, don't have much experience, thank you for the feedback!

-Yes, will add comment in Report.hpp.

-I agree, will encapsulate them in struct coreScoreData(or alignmentFreq?).

-pBad is I think SANA specific, LOW_PBAD_LIMIT is a lower limit on the mean pBad value so not sure if the variable name should include "core".

-Will modify SANA::TrimCoreScores to SANA::trimCoreScores.

-Moving out core-score related code, will look something like: Including coreScore.hpp In SANA::SANA()

ifdef CORES

    initCSData(int n1, int n2);
    #endif
In SANA::performChange()
    #ifdef CORES
    updCSDataChange(uint source, unit betterHole, double meanPBad, double pBad)
    #endif
And so on..

Is this fine, should I try something else, or leave the the code untouched.

-Getting the coreScoreData to the upper level so that report is not called from SANA.cpp will require modification of Alignment class to include this struct. Apart from added complexity, it will also result in loss of generality as no alignment method other than SANA produces this struct. Could not come with a better approach.

waynebhayes commented 4 years ago

The purpose of the core-scores related code seems to not be documented much anywhere), but the following caught my attention:

Ten years ago we used the Hungarian algorithm (H-GRAAL) to enumerate all possible optimal alignments when optimizing the graphlet-based local measure. We found that some pairs of nodes remained constant (always aligned) across all optimal alignments; we dubbed such aligned pairs the "core alignment", and found that the core alignment had better Resnik scores than nodes outside the core. This idea can be extended to SANA: if we run SANA multiple times on the same network pair, some aligned pairs will appear more frequently than others in the final output alignments. In the stochastic analogy to H-GRAAL's result, we find that pairs appearing more frequently in the final output alignment tend to have higher Resnik scores. Each aligned pair is thus assigned a "Network Alignment Frequency", or NAF, and NAF correlates with functional similarity. This is great, but it's expensive to run SANA multiple times (say 100) just to get the "core frequencies". The CORES code is intended to get a similar measure in just one run, by taking statistics at every single iteration: which pair of aligned nodes was better, the old or the new? I added the code (basically in C) over a year ago, and it seemed to work well. Shubham has now independently verified it on more recent networks, and in order to reduce technical debt, I asked him to make it more C++-friendly. (eg., my cores frequencies went to stdout, which was a terrible place to put several hundred thousand lines of output). Now it goes to "sana.naf" (or whatever "-o" specifies).

But all your comments are valid. Moving it from C to C++, and the output from stdout to sana.naf, is a great improvement, but your suggestions would reduce the technical debt even further.