kiharalab / DiffModeler

DiffModeler: a diffusion model based protein complex structure modeling tool.
https://em.kiharalab.org/algorithm/DiffModeler
26 stars 2 forks source link

read score issue #2

Closed jadolfbr closed 7 months ago

jadolfbr commented 7 months ago

I am getting this issue

Traceback (most recent call last):
  File "main.py", line 126, in <module>
    fit_structure_chain(diff_trace_map,fitting_dict,fitting_dir,params)
  File "/home/jadolfbr/DiffModeler/modeling/fit_structure_chain.py", line 75, in fit_structure_chain
    read_score(new_score_dict,pdb_dir,output_path)
  File "/home/jadolfbr/DiffModeler/modeling/score_utils.py", line 40, in read_score
    listoldpdb = [x for x in os.listdir(pdb_dir) if ".pdb" in x]
FileNotFoundError: [Errno 2] No such file or directory: '/home/jadolfbr/DiffModeler/Predict_Result/6824/structure_modeling/A/fit_experiment_0/PDB'

Last things that were printed are as follows:

origin          : (79., 42., 38.)
map             : b'MAP '
machst          : [68 68  0  0]
rms             : 0.4092388451099396
nlabl           : 1
label           : [b'Created by mrcfile.py                                       2024-02-08 17:59:07 '
 b'' b'' b'' b'' b'' b'' b'' b'' b'']
/home/jadolfbr/DiffModeler/Predict_Result/6824/structure_assembling/iterative_B existed
WARNING: Use StructureBlurrer.gaussian_blur_real_space_box()to blured a map with a user defined defined cubic box
wang3702 commented 7 months ago

Thank you for your interest in DiffModeler! I think your vesper fitting part failed. I would guess you did not configure VESPER well to run. Could you please provide your output results (all under Predict_Result) to us? You can send your zipped results to my email wang3702@uw.edu. Alternatively, you can also paste the output log of vesper here:/home/jadolfbr/DiffModeler/Predict_Result/6824/structure_modeling/A/vesper_simuoutput*.out. If you do not find such output, then your VESPER failure is confirmed. I think we would also need your command line and output files for us to debug.

wang3702 commented 7 months ago

Also, feel free to use our server https://em.kiharalab.org/algorithm/DiffModeler. I never saw such errors on our server yet.

jadolfbr commented 7 months ago

Thanks for the quick reply! I cloned VESPER, but when I ran DiffModeler, it seemed like it was automatically getting it to work? I didn't see any instructions for VESPER here in the DiffModeler page - did I miss it somewhere? Maybe I can try to properly get that setup, re-run, and if still problems, wills end the results?

jadolfbr commented 7 months ago

Alright, so the output is 1.9GB. We may need to grab specific components of the outputs. I have the logs (and the vesper_simu output) and pretty much everything up to where it crashed.

What would be most helpful to send?

jadolfbr commented 7 months ago

Alright, so I see all the setup in VESPER_CUDA. Part of the setup has it being part of a different env, conda activate vesper_cuda, are you activating this env in DiffModeler script, or are you using the same DiffModeler env to call both?

wang3702 commented 7 months ago

It will automatically configured. You do not need to configure it again. Then please share us this file: /home/jadolfbr/DiffModeler/Predict_Result/6824/structure_modeling/A/vesper_simuoutput*.out.

wang3702 commented 7 months ago

Also, could you please list all the generated files under /home/jadolfbr/DiffModeler/Predict_Result/6824/structure_modeling/A/. Could you also please provide your command line to run DiffModeler?

jadolfbr commented 7 months ago

Here is the full output of the error after running multiple times and confirming that the GPU is blocking. This error hits every time (4 separate runs), and only one person (me) is using this GPU, confirmed by Nvidia-smi.

cmd: python3 main.py --mode=0 -F=example/6824.mrc -P=example -M=example/input_info.txt --config=config/diffmodeler.json --contour=2 --gpu=0 --resolution=5.8

Full error:

sampling loop time step: 100%|██████████| 100/100 [00:19<00:00,  5.04it/s]Traceback (most recent call last):
  File "/home/jadolfbr/DiffModeler/VESPER_CUDA/main.py", line 183, in <module>
    fitter = MapFitter(
  File "/home/jadolfbr/DiffModeler/VESPER_CUDA/fitter.py", line 147, in __init__
    self.ldp_atoms = torch.from_numpy(np.array(ldp_atoms)).to(self.device)
RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Traceback (most recent call last):
  File "main.py", line 126, in <module>
    fit_structure_chain(diff_trace_map,fitting_dict,fitting_dir,params)
  File "/home/jadolfbr/DiffModeler/modeling/fit_structure_chain.py", line 75, in fit_structure_chain
    read_score(new_score_dict,pdb_dir,output_path)
[vesper_simu_output_1.txt](https://github.com/kiharalab/DiffModeler/files/14223400/vesper_simu_output_1.txt)

  File "/home/jadolfbr/DiffModeler/modeling/score_utils.py", line 40, in read_score
    listoldpdb = [x for x in os.listdir(pdb_dir) if ".pdb" in x]
FileNotFoundError: [Errno 2] No such file or directory: '/home/jadolfbr/DiffModeler/Predict_Result/6824/structure_modeling/A/fit_experiment_0/PDB'

vesper_simu_output_0.txt vesper_simu_output_1.txt

Contents of structure modeling:


-rw-r--r--  1 jadolfbr  staff    19K Feb  8 18:22 vesper_simu_output_1.txt
-rw-r--r--  1 jadolfbr  staff   365K Feb  8 18:22 top1.pdb
-rw-r--r--  1 jadolfbr  staff   5.5M Feb  8 18:22 iterative_0_tmp.mrc
-rw-r--r--  1 jadolfbr  staff   5.5M Feb  8 18:22 iterative_0.mrc
drwxr-xr-x  3 jadolfbr  staff    96B Feb  8 18:22 fit_experiment_0
-rw-r--r--  1 jadolfbr  staff    38K Feb  8 18:22 vesper_log
-rw-r--r--  1 jadolfbr  staff    21K Feb  8 18:22 score.pkl

-rw-r--r--  1 jadolfbr  staff   5.5M Feb  8 18:22 iterative_1_tmp.mrc
-rw-r--r--  1 jadolfbr  staff   5.5M Feb  8 18:22 iterative_1.mrc
drwxr-xr-x  3 jadolfbr  staff    96B Feb  8 18:22 fit_experiment_1
-rw-r--r--  1 jadolfbr  staff    19K Feb  8 18:23 vesper_simu_output_0.txt```
AntiMatter568 commented 7 months ago

It is possible that your GPU's compute mode is in Exclusive process mode. You can check on the right side of the output panel from nvidia-smi command. If it shows E. Process for the specific GPU, then VESPER_CUDA will not be functional because of context sharing across threads. I think the temporary fix would be either changing the compute mode to default or change the "thread: 6" to "thread: 1" in "vesper" section in config/diffmodeler.json file for diffmodeler. I have no way to validate the latter though.

wang3702 commented 7 months ago

The upper one is from our VESPER cuda developer. Please see if his suggestion works for you. If it works, we will update the instructions.

jadolfbr commented 7 months ago

Thanks both. Yes, it is exclusive and because this is an AWS VPCx, there is no way to change this (and it would be not good if it was changed due to spot instances/etc.).

I will try the later and get the code up and running on EC2/SageMaker where I have more control over exclusive modes. Usually you want exclusive modes, so at first I thought this was someone else trying to use the GPU! I will try this next week. Thanks for your timely input, it is very much appreciated!

wang3702 commented 7 months ago

Let us know if you still encounter problems. Glad to help.

jadolfbr commented 7 months ago

This was indeed the problem. Running on a SageMaker issue that allows concurrency on the GPU, fixed this - though I do wish I could run it on other systems as well. Thanks for the help!