Gathering code to run `-mm 5` or `-mm 6` for two protein structures

rbdavid commented 5 months ago

I am only focusing on the standard runs of -mm 5 and -mm 6 for two monomeric protein structures. I want to remove all extraneous parameters that are currently implemented in USalign's code base (i_opt, d_opt, etc etc etc). USalign_main is a branching function based on many of these parameters. Instead, I hope to pare down the code to a single path ... alright, not actually a single path ... that can handle the fNS and sNS alignment algorithms for two protein structures. This functional process can then be bound to python handles via ctypes or, more likely, pybind11 and implemented in a python-controlled workflow that incorporates USalign as a single step in the whole process.

No need to loop over all possible structures in a directory, all possible chains in said structures, etc.
No need to print or write results to standard out or storage.
No need to include code that handles various input parameters for non-standard behavior. (i_opt, d_opt, a_opt, etc)

All the while, use diagramming tools to map the processes being used in the code base.

rbdavid commented 5 months ago

Basic design diagram:

---
title: Basic design diagram
---
flowchart TD;
    id1(Python Input)-->id2(SOIalign_main);
    id2-->id3(fNS, -mm 5);
    id2-->id4(sNS, -mm 6);
    id3-->id5(Results);
    id4-->id5(Results);
    id5-->id6(Python Output);

rbdavid commented 5 months ago

Current (as of 2024-06-13) breakdown of the USalign path to run the -mm 5 or -mm 6:

---
title: Current USalign run process
---
flowchart TD;
    id1(CLI input
    ex: USalign pdb1 pdb2 -mm 6)-->id2(`main`, def in USalign.cpp
    parses so so many possible input parameters);
    subgraph USalign.cpp
    id2-->id3(`SOIalign`, def in USalign.cpp
    loop over mobile structures' chains and target structures' chains
    reading each file, gathering cartesian coords for all atoms, gathering sequence information);
    end
    id3-->|for each alignment pair|id4(`SOIalign_main`, def in SOIalign.h);
    subgraph `SOIalign_main`
    id4-->id5(`parameter_set4search`, def in param_set.h
    creates an initial set of d0 and Lnorm values to use for scoring);
    id5-->id6(`CPalign_main`* 
    do an initial alignment calculation that is sequence-order-dependent but can identify circular permutations);
    id6-->|`if mm_opt==6`|id7a(fill the `invmap0` and `fwdmap0` arrays with index values from the CP-aligned sequences );
    id6-->id7(`do_rotation`
    apply the `CPalign_main`'s final translation and rotation matrices to the mobile structure);
    id7a-->id7;
    id7-->id8('SOI_super2score`
    get scores based on this superimposition);
    id8-->id9(`SOI_iter`, def in SOIalign.h
    done twice, one mobile focused the other target focused.
    Up to 30 iterations of `soi_egs` enhanced greedy search for alignment optimization. A convergence value of 1e-6 is hard-coded within the function.);
    id9-->id10(`detailed_search_standard`, def in TMalign.h
    calls `TMscore8_search_standard` which is called elsewhere as well
    unsure how these different calls are being applied);
    id9-->|if `closeK_opt >=3`|id11(`get_SOI_initial_assign`, def in SOIalign.h);
    id11-->id12('SOI_assign2super`, def in SIOalign.h);
    id12-->id13('SOI_iter`, see above);
    id13-->id14('soi_egs`, def in SOIalign.h
    target focused?);
    id13-->|`if mm_opt==6`|id15(`NWDP_TM`, def in NW.h
    calls one form of the NWDP_TM functions defined);
    id15-->id14;
    id14-->id16(`SOI_assign2super` def in SOIalign.h);
    id16-->id17(`SOI_iter`, see above
    target focused);
    id17-->id10;
    id10-->id18(`do_rotation`, def in basic_fun.h
    apply the current translation and rotation matrices to mobile's coordinates);
    id18-->id19(`Kabsch`, def in Kabsch.h
    only used to calculate the new rmsd value);
    id19-->id20(`parameter_set4final`, def in param_set.h
    recalc the scoring parameters d0 and Lnorm);
    id20-->|do it all again with target focus|id18;
    id20-->id21(`if a_opt`; SKIP);
    id21-->id22(`if u_opt`; SKIP);
    id22-->id23(`if d_opt`; SKIP);
    id23-->id24(`do_rotation`, see above
    not sure why this is happening here...);
    id24-->id25(chunk of code to format sequence alignment strings
    and calculate the alignment seq identity);
    end
    id25-->id26(back to `SOIalign` to print results to stdout 
    and/or files on storage and clean up, SKIP);

rbdavid commented 5 months ago

Select set of parameters read into SOIalign_main as named in SOIalign in USalign.cpp:

closeK_opt, an int. Used to create xk array (commented as "k closest residues") of size (len*closeK_opt, 3). Also used in the one boolean test closeK_opt >= 3. Strangely, -closeK is parsed as a possible CLI input parameter but commented out in the help text. In the variable declaration of main, comments say "5 and 0 for -mm 5 and 6". So, if the user does not explicitly define a value via -closeK, then closeK_opt is set to 5 if mm_opt==5 and to 0 for all other values (importantly mm_opt=6).
i_opt, declared in main as an int. If CLI -i is defined, i_opt=1. If CLI -I is defined, i_opt=3. I honestly have no idea why this is fed through the various function calls... Lines 3283-3289 and 3297 in USalign.cpp suggests that if the mm_opt is given a value, then any calls to -i or -I will result in the USalign erroring out.
u_opt, declared in main as an int to be used as the specified length to normalize TMscore values by. I honestly have no idea why this is fed through the various function calls... Line 3298 in USalign.cpp suggests that if the mm_opt is given a value, then any calls using -u or -L will result in the USalign erroring out.
a_opt, declared in main as an int, used as a flag to normalize TMscore values by average length... Strange accepted values though based on line 2976, "-a must be -2, -1, 1, T, or F". The help text only describes -a as taking T or F values.
d_opt, declared in main as a bool, used as a flag to calc TMscore values by a user-specified d0 value (given to d0_scale double variable). Add d0_scale to variables that can be skipped for this implementation.
force_fast_opt, declared in main as a bool, used as a flag to run the fTM-align algorithm. Used in SOIalign_main, to set iteration_max to 2 if fast_opt != 0. Also, passed to the CPalign_main call. Also, if fast_opt != 0, then simplify_step is set to 40 for the first call of detailed_search_standard.
mol_vec1[chain_i]+mol_vec2[chain_j], declared in the SOIalign_main call and assigned to the mol_type variable name. Used in the CPalign_main call. Also, passed to the parameter_set4final calls.

I believe all, except for maybe closeK_opt and force_fast_opt, are either useless (i_opt, u_opt) or should be skipped (a_opt, d_opt, mol_type) for this python binding implementation. Like, there is a whole chunk of code within the SOIalign_main function testing if (u_opt) (lines 889 to 902). But u_opt will always equal zero if the code has already gotten to this point! Maybe the compiler will notice this and skip the code, but that doesn't change the fact that the code is there in the source. Its confusing as hell.

Edit: these parameters are passed into SOIalign_main because CPalign_main is called as part of that function's flow, within which TMalign_main is called which has code that handles the numerous states that the USalign parameters can create. So, I cannot remove these variables from the CPalign_main call but they don't need to be defined explicitly and especially don't need to be read in as arguments to the SOIalign_main function. Just set them to 0 or false or whatever is needed when the functions are called. Add comments to explain why they are set that way. This will make creating the python bindings easier since I can pare down the number of input parameters that need to be explicitly defined.

rbdavid commented 5 months ago

Parameters that maybe should be defined explicitly:

The iteration_max variable is hardcoded as either 2, if fast_opt !=0 , or 30 otherwise. Its unclear to me if the max value of 30 is limiting to the alignment quality. Additionally, the number of iterations may result in an unnecessary bottle neck to the code. Used in SOI_iter.
A convergence criteria for tmscore difference (fabs(tmscore_old-tmscore)) between iterations is hardcoded as 1e-6 in the SIO_iter. Related to above since the number of iterations actually performed is dependent on whether the convergence criteria is too small.

I think profiling of the original USalign code should be done to check whether default behavior is worthwhile changing.

rbdavid commented 5 months ago

Breakdown of the CPalign_main call:

---
title: CPalign_main, def in lines 3414-3598 in TMalign.h
---
flowchart TD;
    id1(Declare and fill variables for Circular Permutation
    mobile coord `xa` and seq `seqx` variables are concatenated twice to create `_cp` versions)-->id2(`TMalign_main` called, 
    uses `xa_cp` and `seqx_cp` variables instead of their singular representatives as well as:
    i_opt= 0 // don't use pre-defined alignment
    a_opt= false // don't use average length to normalize TMscores
    u_opt= true // do use user-defined length to normalize TMscore
    - in relation to u_opt -  Lnorm_ass= Lnorm_tmp - just the min of xlen,ylen - // normalize by min length of the mobile and target...
    d_opt= false // don't use d0_scale even though its read into the function
    fast_opt= true // run a quick alignment );
    id2-->id3(aligned sequences -- `A_cp` variants -- are used to detect ~semi~ sequence-order-independent alignments
    basically, a linear search along the duplicated mobile structure/sequence space allows for alignments with target that don't follow the linear search that `TMalign_main` is usually stuck in);
    id3-->id4(find the circular permutation point that maximizes the number of aligned residues
    save the index for this residue in `cp_point`);
    id4-->id5(`TMalign_main` called to run standard fTM-align, 
    uses `xa` and `seqx` variables instead of the `_cp` variants as well as variables defined above.);
    id5-->id6(check which `TMalign_main` call returned a greater number of aligned residues
    if the 2nd performed better, then `cp_point=0`);
    id6-->id7(`TMalign_main` 
    but this time a full run with user-defined parameters instead of hard coded values as above);
    id6-->|`if cp_point!=0`|id8(refill the `xa_cp`, `seqx_cp`, and `secx_cp` arrays with values relevant to the largest circular permutation point in the original `TMalign_main` call.);
   id8-->id9(`TMalign_main` called, 
   uses the `_cp` variants altered as above. same parameters as previous calls.);
   id9-->id10(once more check if the non-CP `TMalign_main` call aligned fewer residues than this latest `TMalign_main` call
   if it didn't, then go back to the original `xa`, `seqx`, `secx` array values);
   id10-->id7;
   id7-->|`if cp_point>0`|id11(fix sequence alignment strings to account for the circular permutation point that was identified);
   id7-->|else|id12(aligned sequences are unaffected from the last `TMalign_main` call);
   id11-->id13(clean up and return the sequence position where the largest circular permutation happens);
   id12-->id13;

rbdavid / USalign

Gathering code to run `-mm 5` or `-mm 6` for two protein structures #1