rbdavid / USalign

Universal Structure Alignment of Monomeric and Complex Structure of Nucleic Acids and Proteins
https://zhanggroup.org/US-align/
Other
0 stars 0 forks source link

Gathering code to run `-mm 5` or `-mm 6` for two protein structures #1

Open rbdavid opened 5 months ago

rbdavid commented 5 months ago

I am only focusing on the standard runs of -mm 5 and -mm 6 for two monomeric protein structures. I want to remove all extraneous parameters that are currently implemented in USalign's code base (i_opt, d_opt, etc etc etc). USalign_main is a branching function based on many of these parameters. Instead, I hope to pare down the code to a single path ... alright, not actually a single path ... that can handle the fNS and sNS alignment algorithms for two protein structures. This functional process can then be bound to python handles via ctypes or, more likely, pybind11 and implemented in a python-controlled workflow that incorporates USalign as a single step in the whole process.

All the while, use diagramming tools to map the processes being used in the code base.

rbdavid commented 5 months ago

Basic design diagram:

---
title: Basic design diagram
---
flowchart TD;
    id1(Python Input)-->id2(SOIalign_main);
    id2-->id3(fNS, -mm 5);
    id2-->id4(sNS, -mm 6);
    id3-->id5(Results);
    id4-->id5(Results);
    id5-->id6(Python Output);
rbdavid commented 5 months ago

Current (as of 2024-06-13) breakdown of the USalign path to run the -mm 5 or -mm 6:

---
title: Current USalign run process
---
flowchart TD;
    id1(CLI input
    ex: USalign pdb1 pdb2 -mm 6)-->id2(`main`, def in USalign.cpp
    parses so so many possible input parameters);
    subgraph USalign.cpp
    id2-->id3(`SOIalign`, def in USalign.cpp
    loop over mobile structures' chains and target structures' chains
    reading each file, gathering cartesian coords for all atoms, gathering sequence information);
    end
    id3-->|for each alignment pair|id4(`SOIalign_main`, def in SOIalign.h);
    subgraph `SOIalign_main`
    id4-->id5(`parameter_set4search`, def in param_set.h
    creates an initial set of d0 and Lnorm values to use for scoring);
    id5-->id6(`CPalign_main`* 
    do an initial alignment calculation that is sequence-order-dependent but can identify circular permutations);
    id6-->|`if mm_opt==6`|id7a(fill the `invmap0` and `fwdmap0` arrays with index values from the CP-aligned sequences );
    id6-->id7(`do_rotation`
    apply the `CPalign_main`'s final translation and rotation matrices to the mobile structure);
    id7a-->id7;
    id7-->id8('SOI_super2score`
    get scores based on this superimposition);
    id8-->id9(`SOI_iter`, def in SOIalign.h
    done twice, one mobile focused the other target focused.
    Up to 30 iterations of `soi_egs` enhanced greedy search for alignment optimization. A convergence value of 1e-6 is hard-coded within the function.);
    id9-->id10(`detailed_search_standard`, def in TMalign.h
    calls `TMscore8_search_standard` which is called elsewhere as well
    unsure how these different calls are being applied);
    id9-->|if `closeK_opt >=3`|id11(`get_SOI_initial_assign`, def in SOIalign.h);
    id11-->id12('SOI_assign2super`, def in SIOalign.h);
    id12-->id13('SOI_iter`, see above);
    id13-->id14('soi_egs`, def in SOIalign.h
    target focused?);
    id13-->|`if mm_opt==6`|id15(`NWDP_TM`, def in NW.h
    calls one form of the NWDP_TM functions defined);
    id15-->id14;
    id14-->id16(`SOI_assign2super` def in SOIalign.h);
    id16-->id17(`SOI_iter`, see above
    target focused);
    id17-->id10;
    id10-->id18(`do_rotation`, def in basic_fun.h
    apply the current translation and rotation matrices to mobile's coordinates);
    id18-->id19(`Kabsch`, def in Kabsch.h
    only used to calculate the new rmsd value);
    id19-->id20(`parameter_set4final`, def in param_set.h
    recalc the scoring parameters d0 and Lnorm);
    id20-->|do it all again with target focus|id18;
    id20-->id21(`if a_opt`; SKIP);
    id21-->id22(`if u_opt`; SKIP);
    id22-->id23(`if d_opt`; SKIP);
    id23-->id24(`do_rotation`, see above
    not sure why this is happening here...);
    id24-->id25(chunk of code to format sequence alignment strings
    and calculate the alignment seq identity);
    end
    id25-->id26(back to `SOIalign` to print results to stdout 
    and/or files on storage and clean up, SKIP);
rbdavid commented 5 months ago

Select set of parameters read into SOIalign_main as named in SOIalign in USalign.cpp:

I believe all, except for maybe closeK_opt and force_fast_opt, are either useless (i_opt, u_opt) or should be skipped (a_opt, d_opt, mol_type) for this python binding implementation. Like, there is a whole chunk of code within the SOIalign_main function testing if (u_opt) (lines 889 to 902). But u_opt will always equal zero if the code has already gotten to this point! Maybe the compiler will notice this and skip the code, but that doesn't change the fact that the code is there in the source. Its confusing as hell.

Edit: these parameters are passed into SOIalign_main because CPalign_main is called as part of that function's flow, within which TMalign_main is called which has code that handles the numerous states that the USalign parameters can create. So, I cannot remove these variables from the CPalign_main call but they don't need to be defined explicitly and especially don't need to be read in as arguments to the SOIalign_main function. Just set them to 0 or false or whatever is needed when the functions are called. Add comments to explain why they are set that way. This will make creating the python bindings easier since I can pare down the number of input parameters that need to be explicitly defined.

rbdavid commented 5 months ago

Parameters that maybe should be defined explicitly:

I think profiling of the original USalign code should be done to check whether default behavior is worthwhile changing.

rbdavid commented 5 months ago

Breakdown of the CPalign_main call:

---
title: CPalign_main, def in lines 3414-3598 in TMalign.h
---
flowchart TD;
    id1(Declare and fill variables for Circular Permutation
    mobile coord `xa` and seq `seqx` variables are concatenated twice to create `_cp` versions)-->id2(`TMalign_main` called, 
    uses `xa_cp` and `seqx_cp` variables instead of their singular representatives as well as:
    i_opt= 0 // don't use pre-defined alignment
    a_opt= false // don't use average length to normalize TMscores
    u_opt= true // do use user-defined length to normalize TMscore
    - in relation to u_opt -  Lnorm_ass= Lnorm_tmp - just the min of xlen,ylen - // normalize by min length of the mobile and target...
    d_opt= false // don't use d0_scale even though its read into the function
    fast_opt= true // run a quick alignment );
    id2-->id3(aligned sequences -- `A_cp` variants -- are used to detect ~semi~ sequence-order-independent alignments
    basically, a linear search along the duplicated mobile structure/sequence space allows for alignments with target that don't follow the linear search that `TMalign_main` is usually stuck in);
    id3-->id4(find the circular permutation point that maximizes the number of aligned residues
    save the index for this residue in `cp_point`);
    id4-->id5(`TMalign_main` called to run standard fTM-align, 
    uses `xa` and `seqx` variables instead of the `_cp` variants as well as variables defined above.);
    id5-->id6(check which `TMalign_main` call returned a greater number of aligned residues
    if the 2nd performed better, then `cp_point=0`);
    id6-->id7(`TMalign_main` 
    but this time a full run with user-defined parameters instead of hard coded values as above);
    id6-->|`if cp_point!=0`|id8(refill the `xa_cp`, `seqx_cp`, and `secx_cp` arrays with values relevant to the largest circular permutation point in the original `TMalign_main` call.);
   id8-->id9(`TMalign_main` called, 
   uses the `_cp` variants altered as above. same parameters as previous calls.);
   id9-->id10(once more check if the non-CP `TMalign_main` call aligned fewer residues than this latest `TMalign_main` call
   if it didn't, then go back to the original `xa`, `seqx`, `secx` array values);
   id10-->id7;
   id7-->|`if cp_point>0`|id11(fix sequence alignment strings to account for the circular permutation point that was identified);
   id7-->|else|id12(aligned sequences are unaffected from the last `TMalign_main` call);
   id11-->id13(clean up and return the sequence position where the largest circular permutation happens);
   id12-->id13;