Open rmagesh148 opened 6 months ago
Thank you for reporting this error. We have addressed the issue with the path settings and have committed the updated version. Please review the changes and let us know if you encounter any further issues.
Thank you for the response and I am seeing the same error again. I pulled the latest code and built it and I tried running. Please see the attached screenshots for the same. ![Uploading Screenshot 2024-05-31 at 11.26.57 PM.png…]()
You need to replace /path/to/database
with your actual database path
/Users/rmagesh/GradSchool/Research-Phd/deepfold/data_dir```
I tried with the complete path and I tried with the actual path in the docker which is `/app/deepfold/data_dir`
unfortunately both the runs are failing. Thanks!
python run_from_fasta.py --fasta_paths ./example_data/fasta/1aac_1_A.fasta --model_names model1 --model_paths params/model1.npz --data_dir /app/deepfold/data_dir --output_dir ./out
python run_from_fasta.py --fasta_paths ./example_data/fasta/aa/1aac_1_A.fasta --model_names model1 --model_paths params/model1.npz --data_dir /Users/rmagesh/GradSchool/Research-Phd/deepfold/data_dir --output_dir ./out
What is the output of python run_from_fasta.py --fasta_paths ./example_data/fasta/aa/1aac_1_A.fasta --model_names model1 --model_paths params/model1.npz --data_dir /app/deepfold/data_dir --output_dir ./out
?
Could you please check if the database files are properly downloaded and accessible in Docker container?
I don't see any files inside the data_dir folder and what I did was just build the docker image and run the docker run without --gpus as I am trying it on my local machine
I see that all the database files are properly loaded
Your local machine should have the following database files:
$DOWNLOAD_DIR/ # Total: ~ 2.62 TB (download: 556 GB)
bfd/ # ~ 1.8 TB (download: 271.6 GB)
# 6 files.
mgnify/ # ~ 120 GB (download: 67 GB)
mgy_clusters_2022_05.fa
pdb70/ # ~ 56 GB (download: 19.5 GB)
# 9 files.
pdb_mmcif/ # ~ 238 GB (download: 43 GB)
mmcif_files/
# About 199,000 .cif files.
obsolete.dat
pdb_seqres/ # ~ 0.2 GB (download: 0.2 GB)
pdb_seqres.txt
small_bfd/ # ~ 17 GB (download: 9.6 GB)
bfd-first_non_consensus_sequences.fasta
uniref30/ # ~ 206 GB (download: 52.5 GB)
# 7 files.
uniprot/ # ~ 105 GB (download: 53 GB)
uniprot.fasta
uniref90/ # ~ 67 GB (download: 34 GB)
uniref90.fasta
(Check the download_all_data.sh
script to download these files.)
You can use the -v
option to mount this folder to your Docker container.
May I know why do I need to get these database files into my machine / server to just get the features of a fasta ?
The standard pipeline requires an extensive database search to obtain the features of a FASTA file.
Hi again! I am running into an issue where I am not able to install, mpi-jax dependency and it is failing. Could you please help me with that?
Thanks!
Hi @cy3 : Could you please respond? I ran the download_all_data.sh script and these are the folders I have. I believe some of the folders are missing. COuld you please take a look at it ? Thanks!
Command Executed:
python run_from_fasta.py --fasta_paths ./example_data/fasta/aa/1aac_1_A.fasta --model_names model1 --model_paths ./data/params/model1.npz --data_dir ./data/ --output_dir ./out
I ran the above command and the code failed with the error below. Please take a look and help me out with this?
root@597d3d2c0192:/app/deepfold# python run_from_fasta.py --fasta_paths ./example_data/fasta/aa/1aac_1_A.fasta --model_names model1 --model_paths ./data/params/model1.npz --data_dir ./data/ --output_dir ./out
/opt/conda/lib/python3.7/site-packages/absl/flags/_validators.py:233: UserWarning: Flag --data_dir has a non-None default value; therefore, mark_flag_as_required will pass even if flag is not specified in the command line!
mark_flag_as_required(flag_name, flag_values)
/opt/conda/lib/python3.7/site-packages/absl/flags/_validators.py:233: UserWarning: Flag --uniref90_database_path has a non-None default value; therefore, mark_flag_as_required will pass even if flag is not specified in the command line!
mark_flag_as_required(flag_name, flag_values)
/opt/conda/lib/python3.7/site-packages/absl/flags/_validators.py:233: UserWarning: Flag --mgnify_database_path has a non-None default value; therefore, mark_flag_as_required will pass even if flag is not specified in the command line!
mark_flag_as_required(flag_name, flag_values)
/opt/conda/lib/python3.7/site-packages/absl/flags/_validators.py:233: UserWarning: Flag --pdb70_database_path has a non-None default value; therefore, mark_flag_as_required will pass even if flag is not specified in the command line!
mark_flag_as_required(flag_name, flag_values)
/opt/conda/lib/python3.7/site-packages/absl/flags/_validators.py:233: UserWarning: Flag --template_mmcif_dir has a non-None default value; therefore, mark_flag_as_required will pass even if flag is not specified in the command line!
mark_flag_as_required(flag_name, flag_values)
/opt/conda/lib/python3.7/site-packages/absl/flags/_validators.py:233: UserWarning: Flag --max_template_date has a non-None default value; therefore, mark_flag_as_required will pass even if flag is not specified in the command line!
mark_flag_as_required(flag_name, flag_values)
/opt/conda/lib/python3.7/site-packages/absl/flags/_validators.py:233: UserWarning: Flag --obsolete_pdbs_path has a non-None default value; therefore, mark_flag_as_required will pass even if flag is not specified in the command line!
mark_flag_as_required(flag_name, flag_values)
I0819 18:26:00.632040 139699742209856 templates.py:869] Using precomputed obsolete pdbs ./data/pdb_mmcif/obsolete.dat.
E0819 18:26:00.635931 139699742209856 hhblits.py:82] Could not find HHBlits database ./data/uniclust30/UniRef30_2020_06/UniRef30_2020_06
Traceback (most recent call last):
File "run_from_fasta.py", line 280, in <module>
app.run(main)
File "/opt/conda/lib/python3.7/site-packages/absl/app.py", line 312, in run
_run_main(main, args)
File "/opt/conda/lib/python3.7/site-packages/absl/app.py", line 258, in _run_main
sys.exit(main(argv))
File "run_from_fasta.py", line 210, in main
use_small_bfd=False)
File "/app/deepfold/deepfold/data/pipeline.py", line 110, in __init__
databases=[bfd_database_path, uniclust30_database_path])
File "/app/deepfold/deepfold/data/tools/hhblits.py", line 83, in __init__
raise ValueError(f'Could not find HHBlits database {database_path}')
ValueError: Could not find HHBlits database ./data/uniclust30/UniRef30_2020_06/UniRef30_2020_06
root@597d3d2c0192:/app/deepfold#
@newtonjoo @cy3 Could you please take a look at it and help me with it as soon as possible? Thanks!
Could you verify the subfolder name under data/uniclust30? It may have been updated to a more recent version of the database. If that's the case, please adjust the following line to reflect the correct folder path:
Line to be updated: run_from_fasta.py#L73
Thank you for the response, this is how the subfolder for Uniclust30 looks like, there is no UniRef30_2020_06/UniRef30_2020_06 in the subfolder
(deepfold-2) magesh@lambda-ai:~$ cd /media/exxact1/deepfold/data/
(deepfold-2) magesh@lambda-ai:/media/exxact1/deepfold/data$ cd uniclust30/
(deepfold-2) magesh@lambda-ai:/media/exxact1/deepfold/data/uniclust30$ ls -lrt
total 4
drwxr-xr-x 2 root root 4096 Aug 20 15:17 uniclust30_2018_08
(deepfold-2) magesh@lambda-ai:/media/exxact1/deepfold/data/uniclust30$ cd uniclust30_2018_08/
(deepfold-2) magesh@lambda-ai:/media/exxact1/deepfold/data/uniclust30/uniclust30_2018_08$ ls -lrth
total 87G
-rw------- 1 528745 9100 3.6G Oct 11 2018 uniclust30_2018_08_cs219.ffdata
-rw------- 1 528745 9100 341M Oct 11 2018 uniclust30_2018_08_cs219.ffindex
-rw------- 1 528745 9100 65G Oct 11 2018 uniclust30_2018_08_a3m.ffdata
-rw------- 1 528745 9100 359M Oct 11 2018 uniclust30_2018_08_a3m.ffindex
-rw------- 1 528745 9100 14G Oct 11 2018 uniclust30_2018_08_hhm.ffdata
-rw------- 1 528745 9100 7.8M Oct 11 2018 uniclust30_2018_08_hhm.ffindex
-rw------- 1 528745 9100 19 Oct 11 2018 uniclust30_2018_08.cs219.sizes
-rw------- 1 528745 9100 3.8G Oct 11 2018 uniclust30_2018_08.cs219
-rw------- 1 528745 9100 417M Oct 11 2018 uniclust30_2018_08_a3m_db.index
lrwxrwxrwx 1 528745 9100 29 Oct 11 2018 uniclust30_2018_08_a3m_db -> uniclust30_2018_08_a3m.ffdata
-rw------- 1 528745 9100 9.0M Oct 11 2018 uniclust30_2018_08_hhm_db.index
lrwxrwxrwx 1 528745 9100 29 Oct 11 2018 uniclust30_2018_08_hhm_db -> uniclust30_2018_08_hhm.ffdata
-rw------- 1 528745 9100 767 Oct 11 2018 uniclust30_2018_08_md5sum
(deepfold-2) magesh@lambda-ai:/media/exxact1/deepfold/data/uniclust30/uniclust30_2018_08$
Changing UniRef30_2020_06/UniRef30_2020_06
to uniclust30_2018_08/uniclust30_2018_08
will resolve this issue.
Hi @cy3 : Thank you for your inputs, I am able to get the features.pkl for a given fasta. I would like to understand a few more things from the output.
root@a88ef311693f:/app/deepfold# cd ./out/
root@a88ef311693f:/app/deepfold/out# ls
prot_00000
root@a88ef311693f:/app/deepfold/out# cd prot_00000/
root@a88ef311693f:/app/deepfold/out/prot_00000# ls
features.pkl msas ranked_0.pdb ranking_debug.json relaxed_model1.pdb res_plddt_model1.txt result_model1.pkl timings.json unrelaxed_model1.pdb
root@a88ef311693f:/app/deepfold/out/prot_00000# cd msas/
root@a88ef311693f:/app/deepfold/out/prot_00000/msas# ls
bfd_uniclust_hits.a3m mgnify_hits.sto pdb70_hits.hhr uniref90_hits.sto
root@a88ef311693f:/app/deepfold/out/prot_00000/msas# cd ..
root@a88ef311693f:/app/deepfold/out/prot_00000# python
a = reading - features.pkl
>>> type(a)
<class 'dict'>
>>> a.keys()
dict_keys(['aatype', 'between_segment_residues', 'domain_name', 'residue_index', 'seq_length', 'sequence', 'deletion_matrix_int', 'msa', 'num_alignments', 'template_aatype', 'template_all_atom_masks', 'template_all_atom_positions', 'template_domain_names', 'template_sequence', 'template_sum_probs'])
>>> atom_pos = a['template_all_atom_positions']
>>> atom_pos.shape
(20, 105, 37, 3)
In this output features.pkl - > I am not sure which key will give me the embeddings of the protein.. I assume _template_all_atompositions is the one which would give me the embeddings. If so I have a question when I tried to return the shape of the _'template_all_atompositions' it is giving me (20, 105, 37, 3) -> I am not sure where this 20 coming from? could you please help me with that? Thank you! Much appreicated!
The feature.pkl
file is used as the input for the model. (with 20 being the number of templates) For embeddings, you can use result_model1.pkl
.
>>> a = pd.read_pickle('result_model1.pkl')
>>> a.shape
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'dict' object has no attribute 'shape'
>>> type(a)
<class 'dict'>
>>> a.keys()
dict_keys(['distogram', 'experimentally_resolved', 'masked_msa', 'predicted_lddt', 'structure_module', 'plddt'])
which key gives me the representation/embeddings of the protein? Thanks!
In this case, use data['representations']['structure_module']
, which refers to 384-dimensional vector representations per amino acid.
>>> data = pd.read_pickle('result_model1.pkl')
>>> data.keys()
dict_keys(['distogram', 'experimentally_resolved', 'masked_msa', 'predicted_lddt', 'structure_module', 'plddt'])
There is no representations in the keys for the embedding. I could only see what I have above.
>>> data['structure_module'].keys()
dict_keys(['final_atom_mask', 'final_atom_positions', 'sidechains'])
We have made a slight update to the FASTA run file to return representations as well. I apologize for the mistake.
Thank you for the response. When I try to run the deepfold, it is running on cpu instead of my gpu servers.
I0822 21:53:06.209507 139760043013952 xla_bridge.py:353] Unable to initialize backend 'tpu_driver': NOT_FOUND: Unable to find driver in registry given worker:
2024-08-22 21:53:06.237648: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:267] failed call to cuInit: CUDA_ERROR_SYSTEM_NOT_READY: system not yet initialized
I0822 21:53:06.238188 139760043013952 xla_bridge.py:353] Unable to initialize backend 'cuda': FAILED_PRECONDITION: No visible GPU devices.
I0822 21:53:06.238405 139760043013952 xla_bridge.py:353] Unable to initialize backend 'rocm': NOT_FOUND: Could not find registered platform with name: "rocm". Available platform names are: Interpreter Host CUDA
I0822 21:53:06.238757 139760043013952 xla_bridge.py:353] Unable to initialize backend 'tpu': module 'jaxlib.xla_extension' has no attribute 'get_tpu_client'
I0822 21:53:06.238858 139760043013952 xla_bridge.py:353] Unable to initialize backend 'plugin': xla_extension has no attributes named get_plugin_device_client. Compile TensorFlow with //tensorflow/compiler/xla/python:enable_plugin_device set to true (defaults to false) to enable this.
W0822 21:53:06.238930 139760043013952 xla_bridge.py:360] No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
/app/deepfold/deepfold/model/mapping.py:49: FutureWarning: jax.tree_flatten is deprecated, and will be removed in a future release. Use jax.tree_util.tree_flatten instead.
values_tree_def = jax.tree_flatten(values)[1]
/app/deepfold/deepfold/model/mapping.py:53: FutureWarning: jax.tree_unflatten is deprecated, and will be removed in a future release. Use jax.tree_util.tree_unflatten instead.
return jax.tree_unflatten(values_tree_def, flat_axes)
/app/deepfold/deepfold/model/mapping.py:124: FutureWarning: jax.tree_flatten is deprecated, and will be removed in a future release. Use jax.tree_util.tree_flatten instead.
flat_sizes = jax.tree_flatten(in_sizes)[0]
2024-08-22 21:57:20.910848: E external/org_tensorflow/tensorflow/compiler/xla/service/slow_operation_alarm.cc:65]
********************************
[Compiling module jit_apply_fn] Very slow compile? If you want to file a bug, run with envvar XLA_FLAGS=--xla_dump_to=/tmp/foo and attach the results.
Do you have any suggestion for me why it is not picking up my GPU servers?
Did you run Docker with GPU support using commands such as docker run --gpus all
?
What is the result of running nvidia-smi
? Is the GPU properly recognized?
Hey @cy3 : I was able to fix the GPU issue now and it is reading the GPU now. But I have a few questions ?
I0822 21:26:18.952993 139760043013952 inference_pipeline.py:54] processing file ./example_data/fasta/aa/1aac_1_A.fasta...
I0822 21:26:18.953632 139760043013952 jackhmmer.py:130] Launching subprocess "jackhmmer -o /dev/null -A /tmp/tmpzdixdawj/output.sto --noali --F1 0.0005 --F2 5e-05 --F3 5e-07 --incE 0.0001 -E 0.0001 --cpu 8 -N 1 ./example_data/fasta/aa/1aac_1_A.fasta ./data/uniref90/uniref90.fasta"
I0822 21:26:18.974498 139760043013952 utils.py:36] Started Jackhmmer (uniref90.fasta) query
I0822 21:36:51.096411 139760043013952 utils.py:40] Finished Jackhmmer (uniref90.fasta) query in 632.121 seconds
The above command with CPU 8 actually took ~600 seconds to finish and when I ran it with 64 CPUs it took almost similar time and not sure how to improve the performance.
I0830 05:49:55.808584 139870974736192 run_from_fasta.py:244] Using random seed 181129 for the data pipeline
I0830 05:49:55.809070 139870974736192 inference_pipeline.py:54] processing file ./example_data/fasta/aa/ASDSF.fasta...
I0830 05:49:55.809468 139870974736192 jackhmmer.py:130] Launching subprocess "jackhmmer -o /dev/null -A /tmp/tmpvljhqrgk/output.sto --noali --F1 0.0005 --F2 5e-05 --F3 5e-07 --incE 0.0001 -E 0.0001 --cpu 64 -N 1 ./example_data/fasta/aa/ASDSF.fasta ./data/uniref90/uniref90.fasta"
I0830 05:49:55.828207 139870974736192 utils.py:36] Started Jackhmmer (uniref90.fasta) query
I0830 06:00:16.147896 139870974736192 utils.py:40] Finished Jackhmmer (uniref90.fasta) query in 620.319 seconds
So for the two fasta file run, it took an hour to finish with GPUs. Is there anyway to improve the performance further ?
Is deepfold cpu bound? why are these subprocesses running on cpu instead of gpu? @cy3
Jackhmmer, HHblits, and HHsearch are part of the feature-searching pipeline for fasta files. They run on the CPU, and these searches take a lot of time. The process can be improved by generating the features.pkl
file on another CPU machine and then providing it to the pipeline.
@cy3: I have got about ~20k proteins to run; how long you think it might take to just create features.pkl file? In order to just create that file I need the database of 5 TB to be setup right?
If the pickle file is ready, the GPU machine doesn't need the entire database (only the PDB files for templates are required). Additionally, use as many CPU cores as possible; for a single process, processing 20k proteins will take approximately 15k hours. Another workaround is to utilize OpenProteinSet. It provides pre-run MSAs and templates.
I am running into "could not find CIFs" error when running
run_from_fasta.py
file and please find the command below.Attached is the screenshot for the same
Command I am running:
python run_from_fasta.py --fasta_paths ./example_data/fasta/1aac_1_A.fasta --model_names model1 --model_paths params/model1.npz --data_dir /data-dir/ --output_dir ./out