[FEATURE] Custom Secondary Structure Sampling

menoliu commented 2 years ago

Formatting might change depending on how we approach CSSS. Right now I'm thinking about making a novel sub-setting function based on prepare_slice_dict() (which will only be used if -csp is activated) but this new one will have DSSP codes in the dictionary so when we're building, we will only sample torsion angles probabilistically from certain secondary structures codes.

menoliu commented 2 years ago

I'm thinking about implementing a flag in the csss_[ID].txt file to indicate whether if the file was from CheSPI or custom user. As the user could add something like L+: (1-40, 0.95)(60-96, 0.34) \n H+: (1-40, 0.05)(41-59, 1.0) \n E+:(60-96, 0.66) without the need for CheSPI if desired. Flag can go under PRIMARY SEQ: as TYPE: (options including "CheSPI" or "Custom").

No -dr flag would be needed for CSSS as -csss would indicate which SS to sample from. If the user has -csss instead of -dr, another function like prepare_slice_dict() would be used to subset the main database into a database of dictionaries with an additional flag to SS between chunk size and AA+torsion angles (e.g. {1: { H : {‘AA’: [slice(X, Y, None]}}, 2:…}.

To solve the issue with overlapping probabilities for different chunk sizes, an average probability will be calculated for different chunk sizes of different SS structures. E.g. for {3 : {H : {'MAG' : ... average probabilities for M, A and G for 'H' will be used.

After probabilistically subsetting the main database, when building, check the positions and sample from different DSSP regions accordingly. Allow some flexibility for different chunk sizes and overlap to maintain coherent secondary structures.

menoliu commented 2 years ago

Upon further investigation, I may need to operate at the read_db_to_slices_given_secondary_structure() function because if we can select which SS torsion angles go into the first subsetted database, then we can keep the prepare_slice_dict() function as is. Since slices do not indicate which SS is being sampled from.

E.g. instead of having every phi | psi | omega for an AA, we will only save torsion angles of specific secondary structure to AA, thus needing a new column SS | phi | psi | omega.

menoliu commented 2 years ago

Major changes due to logic:

csssconv will now default to store CheSPI values as L+, H+, E+, and G+ as that's how the database is organized. However, if the user wants more modification, just put the flag --verbose and they can feel free to change any of the H/G/I/E/ /T/S/B to regex as having only "H" as a dssp_regex flag will restrict chunk sizes to 1 (we don't want this).
CSSS parser in cli_build can now understand ANY combination of regex secondary structure (if user input is correct) because it does not rely on positional search anymore
read_db_to_slices...() will now also return concatenated secondary structure DSSP codes along with primary sequence and torsion angles. This is because I plan to make operations in the prepare_slice_dict() so different secondary structures can be mixed into the same sequences with different chunk sizes (this is the probabilistic part)

Ideas for the building phase has not changed yet, so far still using positional residue IDs (e.g. S82) to determine which torsion angle to pull. Hopefully our meeting tomorrow will iron out design/implementation flaws.

menoliu commented 2 years ago

Notes regarding latest commit ef886bc:

csssconv will now store CheSPI output in a .JSON dictionary for readability and modularity purposes
cli_build.py will now check that the probabilities for each residue ~= 1.000, if not, normalize
definitions.py fixed to include missing 3-10 helices "G"

Next steps:

Add another layer of DSSP to prepare_slice_dict() So return will be { 1 : {'L+' : {'A' : [slices()]}}} where first key-layer is chunk size, second key layer is DSSP code (not regex), third-key layer is primary sequence
Modify the build function func(aidx) to accept dictCSSS and when selecting for angles it will accept (e.g.) angles = db[RC(slice_dict[plen][pcss][pt_sub]), :].ravel() where pcss = RC(dictCSSS, p=SS_prob)
Create a user-friendly CLI to create csss_[ID].json files if CheSPI data is unavailable or if user wants to try different things

joaomcteixeira commented 2 years ago

Thanks for the detailed messages here.

cli_build.py will now check that the probabilities for each residue ~= 1.000, if not, normalize

And if normalize doesn't work, meaning numbers are not multiples... make it raise an error with an explanatory message.

Good luck with the next steps!

menoliu commented 2 years ago

Notes on e160564: I spend quite a bit of time (~8 hours) trying to implement this in the best way possible. Now, based on CSSS's dssp regexes, the prepare_slice_dict() function now returns a 3 layer dictionary with 1st layer = chunk sizes, 2nd layer = dssp_regex, 3rd layer = primary sequence and respective slice. This was super rewarding to figure out in the end.

And if normalize doesn't work, meaning numbers are not multiples... make it raise an error with an explanatory message.

I will limit test this tomorrow. I think it should be okay as I've normalized/reassigned probabilities based on partial-sums.

If everything goes well tomorrow I think we can proceed to test with different sequences and CSSS input files by Friday. Fingers crossed!

menoliu commented 2 years ago

Notes on 2907d38:

Sum of probabilities of secondary structures will always equal 1.0 (cannot use near 1.0000 with tolerances because when sampling different probabilities in the build function's np.random.choice requires probabilities = 1.0
Theoretically the builder should work, I've checked through the data structures of slice_dict and what angles are and they seem okay, still crashing though. Need to debug.

The for loop is there to sum up probabilities for each residue based on secondary structures. I've kept it there in the case that some residues have unique secondary structure combinations

menoliu commented 2 years ago

Notes on 0edf6bc:

Fixed prepare_slice_dict() to return correct data-structure for CSSS
Building with CSSS is now possible

Next steps:

Edge case testing
idpconfgen build -dr ANY flag for nCr of secondary structure sampling
Testing with real CS data (e.g. DrkSH3)
idpconfgen makecsss (CLI function for user to make csss input files if CS data is not available)

menoliu commented 2 years ago

Notes on 563099d:

I've changed the data structure of slice_dict to {1 : {'A' : {"L+": [slices()] ...}, ...} ...}, this solves the issue with accidental popping and breaking of the whole data structure. The way I've implemented it also takes into considerations of empty SS so no further checks need to be done for empty keys.

Corrected example for reproducibility
ValueError now recognized when a SS cannot be found in the database
Major cleanup of idpconfgen csssconv based on comments and logic
Optimized prepare_slice_dict() in the case of CSSS

joaomcteixeira commented 2 years ago

Great changes. Looking forward the next ones :+1:

joaomcteixeira commented 2 years ago

I see the csss.json you sent to the email have several decimal places. Do you think you can reduce to 3 decimal places? round(..., 3) ? If it still works, then with 3 is better.

joaomcteixeira commented 2 years ago

Great addition @menoliu ! Congratz :clap:

julie-forman-kay-lab / IDPConformerGenerator

[FEATURE] Custom Secondary Structure Sampling #159