MC-SCE SCM Integration Issue Building Longer Proteins

julie-forman-kay-lab / IDPConformerGenerator

Build conformational representations of Intrinsically Disordered Proteins and Regions by a guided sampling of the protein torsion space

https://idpconformergenerator.readthedocs.io/

Apache License 2.0

19 stars 6 forks source link

MC-SCE SCM Integration Issue Building Longer Proteins #191

Closed menoliu closed 2 years ago

menoliu commented 2 years ago

ValueError: shape mismatch when trying to build longer AA proteins such as Asyn (140 AA).
Memory "leak" when undergoing multicore processing. Prepare energy functions runs once for every worker leading to great use of RAM. Should only run once for all workers.

Test line: idpconfgen build -db ./idpconfgen_database.json -seq MDVFMKGLSKAKEGVVAAAEKTKQGVAEAAGKTKEGVLYVGSKTKEGVVHGVATVAEKTKEQVTNVGGAVVTGVTAVAQKTVEGAGSIAAATGFVKKDQLGKNEEGAPQEGILEDMPVDPDNEAYEMPSEEGYQDYEPEA -etbb 100 -et 'pairs' -scm mcsce --mcsce-n_trials 50 -rs 0 -n 5 -nc 10 -of ./SCM_MCSCE_test/

joaomcteixeira commented 2 years ago

the matrix are not gettnig the expected size. So there's something there that MCSCE is sending to idpconfgen that is different for that protein compared with other examples. Run idpconfgen only backbone, and try MCSCE separately to see what happens.

note that you have a mismatch in parameters (n_trials is not being used)

--mcsce-n_trials MCSCE_N_TRIALS
                        Sampling trials in exhaustive mode. Defaults to 10.

menoliu commented 2 years ago

Regarding the --mcsce-n_trials we've always used a custom number (128) in simple mode running mcsce standalone, maybe I had initial misunderstandings about understood "# trials"... However It seems to run slower though with greater success when I implement a higher number for --mcsce-n_trials than default

joaomcteixeira commented 2 years ago

unless I am missing something critical n_trails will only be used if --mcsce-mode exhaustive. That's what I understood from MCSCE code and wrote in the docs. @JerryJohnsonLee help please :wink:

JerryJohnsonLee commented 2 years ago

unless I am missing something critical n_trails will only be used if --mcsce-mode exhaustive. That's what I understood from MCSCE code and wrote in the docs. @JerryJohnsonLee help please 😉

In simple mode n_trials is also in effect, which is the maximum number of trials allowed before returning None, so increasing n_trials will make the success rate of difficult cases higher, but it also takes more time.

I am looking into the other issues.

JerryJohnsonLee commented 2 years ago

For the second issue, the reason is that the energy function calculator initialization resides inside init_mcsce_sidechains , which is a function executed in parallel by idpconfgen code itself. A solution is to move the mcsce initialization code out of the parallel executing code (maybe go into populate_globals?)

menoliu commented 2 years ago

Regarding the second issue, I believe there also might be an issue in idpconfgen. Running MCSCE standalone does not use up much memory actually, only around 1-3 GB for asyn. However just generating backbones with -n 32, idpconfgen uses up to 60GB of memory, is this intended?

menoliu commented 2 years ago

I can see now that there is another layer to the second issue. When running mcsce standalone, we can specify -w 32 and for e.g. 64 trials, each worker tries building the SC twice, but if we're building side chains on the fly in idpconfgen with -n 32 and e.g. 64 trials, each worker, after building the backbones tries to build the SC 64 times, totally 32*64 = 2048 tries to build side chains at any given time if -nc > -n 32. This is just my observations from my unit tests, let me know if I'm understanding correctly because that might be what's contributing to the memory issue more than just preparing the energy function for -n times

JerryJohnsonLee commented 2 years ago

Your understanding is basically correct. The only thing that I need to add is that by default MCSCE will be run under simple mode, and MCSCE will not use multiprocessing in simple mode.

JerryJohnsonLee commented 2 years ago

As for the first issue, the discrepancy comes from a bug in idpconfgen. For the histidines in the sequence, both idpconfgen and MCSCE build HIP residues, which should have HD1, HD2, HE1, HE2. However in the ALL_ATOM_LABELS created by idpconfgen there is only HD2 and HE1, and that is where the two atoms difference come from. @joaomcteixeira could you fix that?

menoliu commented 2 years ago

I found the issue for the HIP, I am implementing a fix right now. Seems like H is read as "HIS" first but translated to "HIP" later that's why it's missing HD1 and HE2.

EDIT: it works! PRing right now. Will not crash for histidine anymore :)

As for the memory issue, I think it's fine as that's just how both programs function, since it's not the reason for the crash we can optimize RAM later?