Closed menoliu closed 2 years ago
I'm thinking about implementing a flag in the csss_[ID].txt file to indicate whether if the file was from CheSPI or custom user. As the user could add something like L+: (1-40, 0.95)(60-96, 0.34) \n H+: (1-40, 0.05)(41-59, 1.0) \n E+:(60-96, 0.66)
without the need for CheSPI if desired. Flag can go under PRIMARY SEQ:
as TYPE:
(options including "CheSPI" or "Custom").
No -dr
flag would be needed for CSSS as -csss
would indicate which SS to sample from. If the user has -csss
instead of -dr
, another function like prepare_slice_dict()
would be used to subset the main database into a database of dictionaries with an additional flag to SS between chunk size and AA+torsion angles (e.g. {1: { H : {‘AA’: [slice(X, Y, None]}}, 2:…}.
To solve the issue with overlapping probabilities for different chunk sizes, an average probability will be calculated for different chunk sizes of different SS structures. E.g. for {3 : {H : {'MAG' : ... average probabilities for M, A and G for 'H' will be used.
After probabilistically subsetting the main database, when building, check the positions and sample from different DSSP regions accordingly. Allow some flexibility for different chunk sizes and overlap to maintain coherent secondary structures.
Upon further investigation, I may need to operate at the read_db_to_slices_given_secondary_structure()
function because if we can select which SS torsion angles go into the first subsetted database, then we can keep the prepare_slice_dict()
function as is. Since slices do not indicate which SS is being sampled from.
E.g. instead of having every phi | psi | omega
for an AA, we will only save torsion angles of specific secondary structure to AA, thus needing a new column SS | phi | psi | omega
.
Major changes due to logic:
csssconv
will now default to store CheSPI values as L+, H+, E+, and G+ as that's how the database is organized. However, if the user wants more modification, just put the flag --verbose
and they can feel free to change any of the H/G/I/E/ /T/S/B
to regex as having only "H" as a dssp_regex flag will restrict chunk sizes to 1 (we don't want this).cli_build
can now understand ANY combination of regex secondary structure (if user input is correct) because it does not rely on positional search anymoreread_db_to_slices...()
will now also return concatenated secondary structure DSSP codes along with primary sequence and torsion angles. This is because I plan to make operations in the prepare_slice_dict()
so different secondary structures can be mixed into the same sequences with different chunk sizes (this is the probabilistic part)Ideas for the building phase has not changed yet, so far still using positional residue IDs (e.g. S82) to determine which torsion angle to pull. Hopefully our meeting tomorrow will iron out design/implementation flaws.
Notes regarding latest commit ef886bc:
csssconv
will now store CheSPI output in a .JSON dictionary for readability and modularity purposescli_build.py
will now check that the probabilities for each residue ~= 1.000, if not, normalizedefinitions.py
fixed to include missing 3-10 helices "G"Next steps:
prepare_slice_dict()
So return
will be { 1 : {'L+' : {'A' : [slices()]}}}
where first key-layer is chunk size, second key layer is DSSP code (not regex), third-key layer is primary sequencefunc(aidx)
to accept dictCSSS
and when selecting for angles
it will accept (e.g.) angles = db[RC(slice_dict[plen][pcss][pt_sub]), :].ravel()
where pcss = RC(dictCSSS, p=SS_prob)
csss_[ID].json
files if CheSPI data is unavailable or if user wants to try different thingsThanks for the detailed messages here.
cli_build.py will now check that the probabilities for each residue ~= 1.000, if not, normalize
And if normalize doesn't work, meaning numbers are not multiples... make it raise an error with an explanatory message.
Good luck with the next steps!
Notes on e160564:
I spend quite a bit of time (~8 hours) trying to implement this in the best way possible. Now, based on CSSS's dssp regexes, the prepare_slice_dict()
function now returns a 3 layer dictionary with 1st layer = chunk sizes, 2nd layer = dssp_regex, 3rd layer = primary sequence and respective slice. This was super rewarding to figure out in the end.
And if normalize doesn't work, meaning numbers are not multiples... make it raise an error with an explanatory message.
I will limit test this tomorrow. I think it should be okay as I've normalized/reassigned probabilities based on partial-sums.
If everything goes well tomorrow I think we can proceed to test with different sequences and CSSS input files by Friday. Fingers crossed!
Notes on 2907d38:
np.random.choice
requires probabilities = 1.0slice_dict
and what angles
are and they seem okay, still crashing though. Need to debug.Notes on 0edf6bc:
prepare_slice_dict()
to return correct data-structure for CSSSNext steps:
idpconfgen build -dr ANY
flag for nCr of secondary structure samplingidpconfgen makecsss
(CLI function for user to make csss input files if CS data is not available)Notes on 563099d:
I've changed the data structure of slice_dict
to {1 : {'A' : {"L+": [slices()] ...}, ...} ...}, this solves the issue with accidental popping and breaking of the whole data structure. The way I've implemented it also takes into considerations of empty SS so no further checks need to be done for empty keys.
example
for reproducibilityValueError
now recognized when a SS cannot be found in the databaseidpconfgen csssconv
based on comments and logicprepare_slice_dict()
in the case of CSSSGreat changes. Looking forward the next ones :+1:
I see the csss.json
you sent to the email have several decimal places. Do you think you can reduce to 3 decimal places? round(..., 3)
? If it still works, then with 3 is better.
Great addition @menoliu ! Congratz :clap:
Formatting might change depending on how we approach CSSS. Right now I'm thinking about making a novel sub-setting function based on
prepare_slice_dict()
(which will only be used if-csp
is activated) but this new one will have DSSP codes in the dictionary so when we're building, we will only sample torsion angles probabilistically from certain secondary structures codes.