Open rbdavid opened 5 months ago
Basic design diagram:
---
title: Basic design diagram
---
flowchart TD;
id1(Python Input)-->id2(SOIalign_main);
id2-->id3(fNS, -mm 5);
id2-->id4(sNS, -mm 6);
id3-->id5(Results);
id4-->id5(Results);
id5-->id6(Python Output);
Current (as of 2024-06-13) breakdown of the USalign path to run the -mm 5
or -mm 6
:
---
title: Current USalign run process
---
flowchart TD;
id1(CLI input
ex: USalign pdb1 pdb2 -mm 6)-->id2(`main`, def in USalign.cpp
parses so so many possible input parameters);
subgraph USalign.cpp
id2-->id3(`SOIalign`, def in USalign.cpp
loop over mobile structures' chains and target structures' chains
reading each file, gathering cartesian coords for all atoms, gathering sequence information);
end
id3-->|for each alignment pair|id4(`SOIalign_main`, def in SOIalign.h);
subgraph `SOIalign_main`
id4-->id5(`parameter_set4search`, def in param_set.h
creates an initial set of d0 and Lnorm values to use for scoring);
id5-->id6(`CPalign_main`*
do an initial alignment calculation that is sequence-order-dependent but can identify circular permutations);
id6-->|`if mm_opt==6`|id7a(fill the `invmap0` and `fwdmap0` arrays with index values from the CP-aligned sequences );
id6-->id7(`do_rotation`
apply the `CPalign_main`'s final translation and rotation matrices to the mobile structure);
id7a-->id7;
id7-->id8('SOI_super2score`
get scores based on this superimposition);
id8-->id9(`SOI_iter`, def in SOIalign.h
done twice, one mobile focused the other target focused.
Up to 30 iterations of `soi_egs` enhanced greedy search for alignment optimization. A convergence value of 1e-6 is hard-coded within the function.);
id9-->id10(`detailed_search_standard`, def in TMalign.h
calls `TMscore8_search_standard` which is called elsewhere as well
unsure how these different calls are being applied);
id9-->|if `closeK_opt >=3`|id11(`get_SOI_initial_assign`, def in SOIalign.h);
id11-->id12('SOI_assign2super`, def in SIOalign.h);
id12-->id13('SOI_iter`, see above);
id13-->id14('soi_egs`, def in SOIalign.h
target focused?);
id13-->|`if mm_opt==6`|id15(`NWDP_TM`, def in NW.h
calls one form of the NWDP_TM functions defined);
id15-->id14;
id14-->id16(`SOI_assign2super` def in SOIalign.h);
id16-->id17(`SOI_iter`, see above
target focused);
id17-->id10;
id10-->id18(`do_rotation`, def in basic_fun.h
apply the current translation and rotation matrices to mobile's coordinates);
id18-->id19(`Kabsch`, def in Kabsch.h
only used to calculate the new rmsd value);
id19-->id20(`parameter_set4final`, def in param_set.h
recalc the scoring parameters d0 and Lnorm);
id20-->|do it all again with target focus|id18;
id20-->id21(`if a_opt`; SKIP);
id21-->id22(`if u_opt`; SKIP);
id22-->id23(`if d_opt`; SKIP);
id23-->id24(`do_rotation`, see above
not sure why this is happening here...);
id24-->id25(chunk of code to format sequence alignment strings
and calculate the alignment seq identity);
end
id25-->id26(back to `SOIalign` to print results to stdout
and/or files on storage and clean up, SKIP);
Select set of parameters read into SOIalign_main
as named in SOIalign
in USalign.cpp:
closeK_opt
, an int. Used to create xk
array (commented as "k closest residues") of size (len*closeK_opt, 3). Also used in the one boolean test closeK_opt >= 3
. Strangely, -closeK
is parsed as a possible CLI input parameter but commented out in the help text. In the variable declaration of main
, comments say "5 and 0 for -mm 5 and 6". So, if the user does not explicitly define a value via -closeK
, then closeK_opt
is set to 5 if mm_opt==5 and to 0 for all other values (importantly mm_opt=6). i_opt
, declared in main
as an int. If CLI -i
is defined, i_opt=1
. If CLI -I
is defined, i_opt=3
. I honestly have no idea why this is fed through the various function calls... Lines 3283-3289 and 3297 in USalign.cpp suggests that if the mm_opt is given a value, then any calls to -i
or -I
will result in the USalign erroring out. u_opt
, declared in main
as an int to be used as the specified length to normalize TMscore values by. I honestly have no idea why this is fed through the various function calls... Line 3298 in USalign.cpp suggests that if the mm_opt is given a value, then any calls using -u
or -L
will result in the USalign erroring out.a_opt
, declared in main
as an int, used as a flag to normalize TMscore values by average length... Strange accepted values though based on line 2976, "-a must be -2, -1, 1, T, or F". The help text only describes -a
as taking T or F values. d_opt
, declared in main
as a bool, used as a flag to calc TMscore values by a user-specified d0 value (given to d0_scale
double variable). Add d0_scale
to variables that can be skipped for this implementation.force_fast_opt
, declared in main
as a bool, used as a flag to run the fTM-align algorithm. Used in SOIalign_main
, to set iteration_max to 2 if fast_opt != 0. Also, passed to the CPalign_main
call. Also, if fast_opt != 0, then simplify_step
is set to 40 for the first call of detailed_search_standard
. mol_vec1[chain_i]+mol_vec2[chain_j]
, declared in the SOIalign_main
call and assigned to the mol_type
variable name. Used in the CPalign_main
call. Also, passed to the parameter_set4final
calls. I believe all, except for maybe closeK_opt
and force_fast_opt
, are either useless (i_opt
, u_opt
) or should be skipped (a_opt
, d_opt
, mol_type
) for this python binding implementation. Like, there is a whole chunk of code within the SOIalign_main
function testing if (u_opt)
(lines 889 to 902). But u_opt
will always equal zero if the code has already gotten to this point! Maybe the compiler will notice this and skip the code, but that doesn't change the fact that the code is there in the source. Its confusing as hell.
Edit: these parameters are passed into SOIalign_main
because CPalign_main
is called as part of that function's flow, within which TMalign_main
is called which has code that handles the numerous states that the USalign parameters can create. So, I cannot remove these variables from the CPalign_main
call but they don't need to be defined explicitly and especially don't need to be read in as arguments to the SOIalign_main
function. Just set them to 0
or false
or whatever is needed when the functions are called. Add comments to explain why they are set that way. This will make creating the python bindings easier since I can pare down the number of input parameters that need to be explicitly defined.
Parameters that maybe should be defined explicitly:
iteration_max
variable is hardcoded as either 2, if fast_opt !=0
, or 30 otherwise. Its unclear to me if the max value of 30 is limiting to the alignment quality. Additionally, the number of iterations may result in an unnecessary bottle neck to the code. Used in SOI_iter
.fabs(tmscore_old-tmscore)
) between iterations is hardcoded as 1e-6 in the SIO_iter
. Related to above since the number of iterations actually performed is dependent on whether the convergence criteria is too small. I think profiling of the original USalign code should be done to check whether default behavior is worthwhile changing.
Breakdown of the CPalign_main
call:
---
title: CPalign_main, def in lines 3414-3598 in TMalign.h
---
flowchart TD;
id1(Declare and fill variables for Circular Permutation
mobile coord `xa` and seq `seqx` variables are concatenated twice to create `_cp` versions)-->id2(`TMalign_main` called,
uses `xa_cp` and `seqx_cp` variables instead of their singular representatives as well as:
i_opt= 0 // don't use pre-defined alignment
a_opt= false // don't use average length to normalize TMscores
u_opt= true // do use user-defined length to normalize TMscore
- in relation to u_opt - Lnorm_ass= Lnorm_tmp - just the min of xlen,ylen - // normalize by min length of the mobile and target...
d_opt= false // don't use d0_scale even though its read into the function
fast_opt= true // run a quick alignment );
id2-->id3(aligned sequences -- `A_cp` variants -- are used to detect ~semi~ sequence-order-independent alignments
basically, a linear search along the duplicated mobile structure/sequence space allows for alignments with target that don't follow the linear search that `TMalign_main` is usually stuck in);
id3-->id4(find the circular permutation point that maximizes the number of aligned residues
save the index for this residue in `cp_point`);
id4-->id5(`TMalign_main` called to run standard fTM-align,
uses `xa` and `seqx` variables instead of the `_cp` variants as well as variables defined above.);
id5-->id6(check which `TMalign_main` call returned a greater number of aligned residues
if the 2nd performed better, then `cp_point=0`);
id6-->id7(`TMalign_main`
but this time a full run with user-defined parameters instead of hard coded values as above);
id6-->|`if cp_point!=0`|id8(refill the `xa_cp`, `seqx_cp`, and `secx_cp` arrays with values relevant to the largest circular permutation point in the original `TMalign_main` call.);
id8-->id9(`TMalign_main` called,
uses the `_cp` variants altered as above. same parameters as previous calls.);
id9-->id10(once more check if the non-CP `TMalign_main` call aligned fewer residues than this latest `TMalign_main` call
if it didn't, then go back to the original `xa`, `seqx`, `secx` array values);
id10-->id7;
id7-->|`if cp_point>0`|id11(fix sequence alignment strings to account for the circular permutation point that was identified);
id7-->|else|id12(aligned sequences are unaffected from the last `TMalign_main` call);
id11-->id13(clean up and return the sequence position where the largest circular permutation happens);
id12-->id13;
I am only focusing on the standard runs of
-mm 5
and-mm 6
for two monomeric protein structures. I want to remove all extraneous parameters that are currently implemented in USalign's code base (i_opt, d_opt, etc etc etc).USalign_main
is a branching function based on many of these parameters. Instead, I hope to pare down the code to a single path ... alright, not actually a single path ... that can handle the fNS and sNS alignment algorithms for two protein structures. This functional process can then be bound to python handles via ctypes or, more likely, pybind11 and implemented in a python-controlled workflow that incorporates USalign as a single step in the whole process.All the while, use diagramming tools to map the processes being used in the code base.