newtonjoo / deepfold

Protein 3D Structure Prediction with DeepFold
Apache License 2.0
17 stars 3 forks source link

Not able to run run_from_fasta.py #2

Open rmagesh148 opened 4 months ago

rmagesh148 commented 4 months ago

I am running into "could not find CIFs" error when running run_from_fasta.py file and please find the command below.

Screenshot 2024-05-30 at 9 09 18 PM

Attached is the screenshot for the same

Command I am running:

python run_from_fasta.py --fasta_paths ./example_data/fasta/1aac_1_A.fasta --model_names model1 --model_paths params/model1.npz --data_dir /data-dir/ --output_dir ./out

E0530 15:38:56.284935 140079475242816 templates.py:849] Could not find CIFs in ./pdb_mmcif/mmcif_files
Traceback (most recent call last):
  File "run_from_fasta.py", line 269, in <module>
    app.run(main)
  File "/opt/conda/lib/python3.7/site-packages/absl/app.py", line 312, in run
    _run_main(main, args)
  File "/opt/conda/lib/python3.7/site-packages/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "run_from_fasta.py", line 186, in main
    obsolete_pdbs_path=FLAGS.obsolete_pdbs_path)
  File "/app/deepfold/deepfold/data/templates.py", line 850, in __init__
    raise ValueError(f'Could not find CIFs in {self._mmcif_dir}')
ValueError: Could not find CIFs in ./pdb_mmcif/mmcif_files
root@24862c4675ea:/app/deepfold# 
cy3 commented 4 months ago

Thank you for reporting this error. We have addressed the issue with the path settings and have committed the updated version. Please review the changes and let us know if you encounter any further issues.

rmagesh148 commented 4 months ago

Thank you for the response and I am seeing the same error again. I pulled the latest code and built it and I tried running. Please see the attached screenshots for the same. ![Uploading Screenshot 2024-05-31 at 11.26.57 PM.png…]()

rmagesh148 commented 4 months ago
Screenshot 2024-05-31 at 11 28 26 PM
cy3 commented 4 months ago

You need to replace /path/to/database with your actual database path

rmagesh148 commented 4 months ago

/Users/rmagesh/GradSchool/Research-Phd/deepfold/data_dir```

I tried with the complete path and I tried with the actual path in the docker which is `/app/deepfold/data_dir`

unfortunately both the runs are failing. Thanks!
rmagesh148 commented 4 months ago

python run_from_fasta.py --fasta_paths ./example_data/fasta/1aac_1_A.fasta --model_names model1 --model_paths params/model1.npz --data_dir /app/deepfold/data_dir --output_dir ./out

python run_from_fasta.py --fasta_paths ./example_data/fasta/aa/1aac_1_A.fasta --model_names model1 --model_paths params/model1.npz --data_dir /Users/rmagesh/GradSchool/Research-Phd/deepfold/data_dir --output_dir ./out

cy3 commented 4 months ago

What is the output of python run_from_fasta.py --fasta_paths ./example_data/fasta/aa/1aac_1_A.fasta --model_names model1 --model_paths params/model1.npz --data_dir /app/deepfold/data_dir --output_dir ./out?

rmagesh148 commented 4 months ago
Screenshot 2024-06-01 at 8 35 33 AM
cy3 commented 4 months ago

Could you please check if the database files are properly downloaded and accessible in Docker container?

rmagesh148 commented 4 months ago
Screenshot 2024-06-01 at 6 51 11 PM

I don't see any files inside the data_dir folder and what I did was just build the docker image and run the docker run without --gpus as I am trying it on my local machine

Screenshot 2024-06-01 at 8 04 11 PM

I see that all the database files are properly loaded

cy3 commented 4 months ago

Your local machine should have the following database files:

$DOWNLOAD_DIR/                             # Total: ~ 2.62 TB (download: 556 GB)
    bfd/                                   # ~ 1.8 TB (download: 271.6 GB)
        # 6 files.
    mgnify/                                # ~ 120 GB (download: 67 GB)
        mgy_clusters_2022_05.fa
    pdb70/                                 # ~ 56 GB (download: 19.5 GB)
        # 9 files.
    pdb_mmcif/                             # ~ 238 GB (download: 43 GB)
        mmcif_files/
            # About 199,000 .cif files.
        obsolete.dat
    pdb_seqres/                            # ~ 0.2 GB (download: 0.2 GB)
        pdb_seqres.txt
    small_bfd/                             # ~ 17 GB (download: 9.6 GB)
        bfd-first_non_consensus_sequences.fasta
    uniref30/                              # ~ 206 GB (download: 52.5 GB)
        # 7 files.
    uniprot/                               # ~ 105 GB (download: 53 GB)
        uniprot.fasta
    uniref90/                              # ~ 67 GB (download: 34 GB)
        uniref90.fasta

(Check the download_all_data.sh script to download these files.)

You can use the -v option to mount this folder to your Docker container.

rmagesh148 commented 4 months ago

May I know why do I need to get these database files into my machine / server to just get the features of a fasta ?

cy3 commented 4 months ago

The standard pipeline requires an extensive database search to obtain the features of a FASTA file.

rmagesh148 commented 2 months ago

Hi again! I am running into an issue where I am not able to install, mpi-jax dependency and it is failing. Could you please help me with that?

Screenshot 2024-08-06 at 2 09 18 PM

Thanks!

rmagesh148 commented 1 month ago

Hi @cy3 : Could you please respond? I ran the download_all_data.sh script and these are the folders I have. I believe some of the folders are missing. COuld you please take a look at it ? Thanks!

Screenshot 2024-08-19 at 1 23 54 PM
rmagesh148 commented 1 month ago

Command Executed:

python run_from_fasta.py --fasta_paths ./example_data/fasta/aa/1aac_1_A.fasta --model_names model1 --model_paths ./data/params/model1.npz --data_dir ./data/ --output_dir ./out

I ran the above command and the code failed with the error below. Please take a look and help me out with this?

root@597d3d2c0192:/app/deepfold# python run_from_fasta.py --fasta_paths ./example_data/fasta/aa/1aac_1_A.fasta --model_names model1 --model_paths ./data/params/model1.npz --data_dir ./data/ --output_dir ./out
/opt/conda/lib/python3.7/site-packages/absl/flags/_validators.py:233: UserWarning: Flag --data_dir has a non-None default value; therefore, mark_flag_as_required will pass even if flag is not specified in the command line!
  mark_flag_as_required(flag_name, flag_values)
/opt/conda/lib/python3.7/site-packages/absl/flags/_validators.py:233: UserWarning: Flag --uniref90_database_path has a non-None default value; therefore, mark_flag_as_required will pass even if flag is not specified in the command line!
  mark_flag_as_required(flag_name, flag_values)
/opt/conda/lib/python3.7/site-packages/absl/flags/_validators.py:233: UserWarning: Flag --mgnify_database_path has a non-None default value; therefore, mark_flag_as_required will pass even if flag is not specified in the command line!
  mark_flag_as_required(flag_name, flag_values)
/opt/conda/lib/python3.7/site-packages/absl/flags/_validators.py:233: UserWarning: Flag --pdb70_database_path has a non-None default value; therefore, mark_flag_as_required will pass even if flag is not specified in the command line!
  mark_flag_as_required(flag_name, flag_values)
/opt/conda/lib/python3.7/site-packages/absl/flags/_validators.py:233: UserWarning: Flag --template_mmcif_dir has a non-None default value; therefore, mark_flag_as_required will pass even if flag is not specified in the command line!
  mark_flag_as_required(flag_name, flag_values)
/opt/conda/lib/python3.7/site-packages/absl/flags/_validators.py:233: UserWarning: Flag --max_template_date has a non-None default value; therefore, mark_flag_as_required will pass even if flag is not specified in the command line!
  mark_flag_as_required(flag_name, flag_values)
/opt/conda/lib/python3.7/site-packages/absl/flags/_validators.py:233: UserWarning: Flag --obsolete_pdbs_path has a non-None default value; therefore, mark_flag_as_required will pass even if flag is not specified in the command line!
  mark_flag_as_required(flag_name, flag_values)
I0819 18:26:00.632040 139699742209856 templates.py:869] Using precomputed obsolete pdbs ./data/pdb_mmcif/obsolete.dat.
E0819 18:26:00.635931 139699742209856 hhblits.py:82] Could not find HHBlits database ./data/uniclust30/UniRef30_2020_06/UniRef30_2020_06
Traceback (most recent call last):
  File "run_from_fasta.py", line 280, in <module>
    app.run(main)
  File "/opt/conda/lib/python3.7/site-packages/absl/app.py", line 312, in run
    _run_main(main, args)
  File "/opt/conda/lib/python3.7/site-packages/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "run_from_fasta.py", line 210, in main
    use_small_bfd=False)
  File "/app/deepfold/deepfold/data/pipeline.py", line 110, in __init__
    databases=[bfd_database_path, uniclust30_database_path])
  File "/app/deepfold/deepfold/data/tools/hhblits.py", line 83, in __init__
    raise ValueError(f'Could not find HHBlits database {database_path}')
ValueError: Could not find HHBlits database ./data/uniclust30/UniRef30_2020_06/UniRef30_2020_06
root@597d3d2c0192:/app/deepfold# 
rmagesh148 commented 1 month ago

@newtonjoo @cy3 Could you please take a look at it and help me with it as soon as possible? Thanks!

cy3 commented 1 month ago

Could you verify the subfolder name under data/uniclust30? It may have been updated to a more recent version of the database. If that's the case, please adjust the following line to reflect the correct folder path:

Line to be updated: run_from_fasta.py#L73

rmagesh148 commented 1 month ago

Thank you for the response, this is how the subfolder for Uniclust30 looks like, there is no UniRef30_2020_06/UniRef30_2020_06 in the subfolder

(deepfold-2) magesh@lambda-ai:~$ cd /media/exxact1/deepfold/data/
(deepfold-2) magesh@lambda-ai:/media/exxact1/deepfold/data$ cd uniclust30/
(deepfold-2) magesh@lambda-ai:/media/exxact1/deepfold/data/uniclust30$ ls -lrt
total 4
drwxr-xr-x 2 root root 4096 Aug 20 15:17 uniclust30_2018_08
(deepfold-2) magesh@lambda-ai:/media/exxact1/deepfold/data/uniclust30$ cd uniclust30_2018_08/
(deepfold-2) magesh@lambda-ai:/media/exxact1/deepfold/data/uniclust30/uniclust30_2018_08$ ls -lrth
total 87G
-rw------- 1 528745 9100 3.6G Oct 11  2018 uniclust30_2018_08_cs219.ffdata
-rw------- 1 528745 9100 341M Oct 11  2018 uniclust30_2018_08_cs219.ffindex
-rw------- 1 528745 9100  65G Oct 11  2018 uniclust30_2018_08_a3m.ffdata
-rw------- 1 528745 9100 359M Oct 11  2018 uniclust30_2018_08_a3m.ffindex
-rw------- 1 528745 9100  14G Oct 11  2018 uniclust30_2018_08_hhm.ffdata
-rw------- 1 528745 9100 7.8M Oct 11  2018 uniclust30_2018_08_hhm.ffindex
-rw------- 1 528745 9100   19 Oct 11  2018 uniclust30_2018_08.cs219.sizes
-rw------- 1 528745 9100 3.8G Oct 11  2018 uniclust30_2018_08.cs219
-rw------- 1 528745 9100 417M Oct 11  2018 uniclust30_2018_08_a3m_db.index
lrwxrwxrwx 1 528745 9100   29 Oct 11  2018 uniclust30_2018_08_a3m_db -> uniclust30_2018_08_a3m.ffdata
-rw------- 1 528745 9100 9.0M Oct 11  2018 uniclust30_2018_08_hhm_db.index
lrwxrwxrwx 1 528745 9100   29 Oct 11  2018 uniclust30_2018_08_hhm_db -> uniclust30_2018_08_hhm.ffdata
-rw------- 1 528745 9100  767 Oct 11  2018 uniclust30_2018_08_md5sum
(deepfold-2) magesh@lambda-ai:/media/exxact1/deepfold/data/uniclust30/uniclust30_2018_08$ 
cy3 commented 1 month ago

Changing UniRef30_2020_06/UniRef30_2020_06 to uniclust30_2018_08/uniclust30_2018_08 will resolve this issue.

rmagesh148 commented 1 month ago

Hi @cy3 : Thank you for your inputs, I am able to get the features.pkl for a given fasta. I would like to understand a few more things from the output.

root@a88ef311693f:/app/deepfold# cd ./out/
root@a88ef311693f:/app/deepfold/out# ls
prot_00000
root@a88ef311693f:/app/deepfold/out# cd prot_00000/
root@a88ef311693f:/app/deepfold/out/prot_00000# ls
features.pkl  msas  ranked_0.pdb  ranking_debug.json  relaxed_model1.pdb  res_plddt_model1.txt  result_model1.pkl  timings.json  unrelaxed_model1.pdb
root@a88ef311693f:/app/deepfold/out/prot_00000# cd msas/
root@a88ef311693f:/app/deepfold/out/prot_00000/msas# ls
bfd_uniclust_hits.a3m  mgnify_hits.sto  pdb70_hits.hhr  uniref90_hits.sto
root@a88ef311693f:/app/deepfold/out/prot_00000/msas# cd ..
root@a88ef311693f:/app/deepfold/out/prot_00000# python   

a = reading - features.pkl

>>> type(a)
<class 'dict'>

>>> a.keys()
dict_keys(['aatype', 'between_segment_residues', 'domain_name', 'residue_index', 'seq_length', 'sequence', 'deletion_matrix_int', 'msa', 'num_alignments', 'template_aatype', 'template_all_atom_masks', 'template_all_atom_positions', 'template_domain_names', 'template_sequence', 'template_sum_probs'])

>>> atom_pos = a['template_all_atom_positions']

>>> atom_pos.shape
(20, 105, 37, 3) 

In this output features.pkl - > I am not sure which key will give me the embeddings of the protein.. I assume _template_all_atompositions is the one which would give me the embeddings. If so I have a question when I tried to return the shape of the _'template_all_atompositions' it is giving me (20, 105, 37, 3) -> I am not sure where this 20 coming from? could you please help me with that? Thank you! Much appreicated!

cy3 commented 1 month ago

The feature.pkl file is used as the input for the model. (with 20 being the number of templates) For embeddings, you can use result_model1.pkl.

rmagesh148 commented 1 month ago
>>> a = pd.read_pickle('result_model1.pkl')
>>> a.shape
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'dict' object has no attribute 'shape'
>>> type(a)
<class 'dict'>
>>> a.keys()
dict_keys(['distogram', 'experimentally_resolved', 'masked_msa', 'predicted_lddt', 'structure_module', 'plddt'])

which key gives me the representation/embeddings of the protein? Thanks!

cy3 commented 1 month ago

In this case, use data['representations']['structure_module'], which refers to 384-dimensional vector representations per amino acid.

rmagesh148 commented 1 month ago
>>> data = pd.read_pickle('result_model1.pkl')
>>> data.keys()
dict_keys(['distogram', 'experimentally_resolved', 'masked_msa', 'predicted_lddt', 'structure_module', 'plddt'])

There is no representations in the keys for the embedding. I could only see what I have above.

>>> data['structure_module'].keys()
dict_keys(['final_atom_mask', 'final_atom_positions', 'sidechains'])
cy3 commented 1 month ago

We have made a slight update to the FASTA run file to return representations as well. I apologize for the mistake.

rmagesh148 commented 1 month ago

Thank you for the response. When I try to run the deepfold, it is running on cpu instead of my gpu servers.

I0822 21:53:06.209507 139760043013952 xla_bridge.py:353] Unable to initialize backend 'tpu_driver': NOT_FOUND: Unable to find driver in registry given worker: 
2024-08-22 21:53:06.237648: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:267] failed call to cuInit: CUDA_ERROR_SYSTEM_NOT_READY: system not yet initialized
I0822 21:53:06.238188 139760043013952 xla_bridge.py:353] Unable to initialize backend 'cuda': FAILED_PRECONDITION: No visible GPU devices.
I0822 21:53:06.238405 139760043013952 xla_bridge.py:353] Unable to initialize backend 'rocm': NOT_FOUND: Could not find registered platform with name: "rocm". Available platform names are: Interpreter Host CUDA
I0822 21:53:06.238757 139760043013952 xla_bridge.py:353] Unable to initialize backend 'tpu': module 'jaxlib.xla_extension' has no attribute 'get_tpu_client'
I0822 21:53:06.238858 139760043013952 xla_bridge.py:353] Unable to initialize backend 'plugin': xla_extension has no attributes named get_plugin_device_client. Compile TensorFlow with //tensorflow/compiler/xla/python:enable_plugin_device set to true (defaults to false) to enable this.
W0822 21:53:06.238930 139760043013952 xla_bridge.py:360] No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
/app/deepfold/deepfold/model/mapping.py:49: FutureWarning: jax.tree_flatten is deprecated, and will be removed in a future release. Use jax.tree_util.tree_flatten instead.
  values_tree_def = jax.tree_flatten(values)[1]
/app/deepfold/deepfold/model/mapping.py:53: FutureWarning: jax.tree_unflatten is deprecated, and will be removed in a future release. Use jax.tree_util.tree_unflatten instead.
  return jax.tree_unflatten(values_tree_def, flat_axes)
/app/deepfold/deepfold/model/mapping.py:124: FutureWarning: jax.tree_flatten is deprecated, and will be removed in a future release. Use jax.tree_util.tree_flatten instead.
  flat_sizes = jax.tree_flatten(in_sizes)[0]
2024-08-22 21:57:20.910848: E external/org_tensorflow/tensorflow/compiler/xla/service/slow_operation_alarm.cc:65] 
********************************
[Compiling module jit_apply_fn] Very slow compile?  If you want to file a bug, run with envvar XLA_FLAGS=--xla_dump_to=/tmp/foo and attach the results.

Do you have any suggestion for me why it is not picking up my GPU servers?

cy3 commented 1 month ago
  1. Did you run Docker with GPU support using commands such as docker run --gpus all?

  2. What is the result of running nvidia-smi? Is the GPU properly recognized?

rmagesh148 commented 1 month ago

Hey @cy3 : I was able to fix the GPU issue now and it is reading the GPU now. But I have a few questions ?

I0822 21:26:18.952993 139760043013952 inference_pipeline.py:54] processing file ./example_data/fasta/aa/1aac_1_A.fasta...
I0822 21:26:18.953632 139760043013952 jackhmmer.py:130] Launching subprocess "jackhmmer -o /dev/null -A /tmp/tmpzdixdawj/output.sto --noali --F1 0.0005 --F2 5e-05 --F3 5e-07 --incE 0.0001 -E 0.0001 --cpu 8 -N 1 ./example_data/fasta/aa/1aac_1_A.fasta ./data/uniref90/uniref90.fasta"
I0822 21:26:18.974498 139760043013952 utils.py:36] Started Jackhmmer (uniref90.fasta) query
I0822 21:36:51.096411 139760043013952 utils.py:40] Finished Jackhmmer (uniref90.fasta) query in 632.121 seconds

The above command with CPU 8 actually took ~600 seconds to finish and when I ran it with 64 CPUs it took almost similar time and not sure how to improve the performance.

I0830 05:49:55.808584 139870974736192 run_from_fasta.py:244] Using random seed 181129 for the data pipeline
I0830 05:49:55.809070 139870974736192 inference_pipeline.py:54] processing file ./example_data/fasta/aa/ASDSF.fasta...
I0830 05:49:55.809468 139870974736192 jackhmmer.py:130] Launching subprocess "jackhmmer -o /dev/null -A /tmp/tmpvljhqrgk/output.sto --noali --F1 0.0005 --F2 5e-05 --F3 5e-07 --incE 0.0001 -E 0.0001 --cpu 64 -N 1 ./example_data/fasta/aa/ASDSF.fasta ./data/uniref90/uniref90.fasta"
I0830 05:49:55.828207 139870974736192 utils.py:36] Started Jackhmmer (uniref90.fasta) query
I0830 06:00:16.147896 139870974736192 utils.py:40] Finished Jackhmmer (uniref90.fasta) query in 620.319 seconds

So for the two fasta file run, it took an hour to finish with GPUs. Is there anyway to improve the performance further ?

rmagesh148 commented 1 month ago

Is deepfold cpu bound? why are these subprocesses running on cpu instead of gpu? @cy3

cy3 commented 1 month ago

Jackhmmer, HHblits, and HHsearch are part of the feature-searching pipeline for fasta files. They run on the CPU, and these searches take a lot of time. The process can be improved by generating the features.pkl file on another CPU machine and then providing it to the pipeline.

rmagesh148 commented 1 month ago

@cy3: I have got about ~20k proteins to run; how long you think it might take to just create features.pkl file? In order to just create that file I need the database of 5 TB to be setup right?

cy3 commented 1 month ago

If the pickle file is ready, the GPU machine doesn't need the entire database (only the PDB files for templates are required). Additionally, use as many CPU cores as possible; for a single process, processing 20k proteins will take approximately 15k hours. Another workaround is to utilize OpenProteinSet. It provides pre-run MSAs and templates.