kalininalab / alphafold_non_docker

AlphaFold2 non-docker setup
347 stars 120 forks source link

database paths #3

Closed ryao-mdanderson closed 3 years ago

ryao-mdanderson commented 3 years ago

Hello author, In file run_alphafold.sh, database path section:

should pdb70_database_path="$data_dir/pdb70/pdb70" uniclust30_database_path="$data_dir/uniclust30/uniclust30_2018_08/uniclust30_2018_08"

be: pdb70_database_path="$data_dir/pdb70" uniclust30_database_path="$data_dir/uniclust30/uniclust30_2018_08"

Thank you! Rong

sanjaysrikakulam commented 3 years ago

Hi @ryao-mdanderson,

Are you sure? During my test run, I did not get any error and the run_docker.py from AF2 points to this path and everything seem to work.

Code from run_docker.py

# Path to the Uniclust30 database for use by HHblits.
uniclust30_database_path = os.path.join(
    DOWNLOAD_DIR, 'uniclust30', 'uniclust30_2018_08', 'uniclust30_2018_08')

# Path to the PDB70 database for use by HHsearch.
pdb70_database_path = os.path.join(DOWNLOAD_DIR, 'pdb70', 'pdb70')
ryao-mdanderson commented 3 years ago

Hi @sanjaysrikakulam 👍 I haven't tested the non-docker version code on HPC cluster. when I review run_alphafold.sh, notice the database paths, e.g. I don't have $data_dir/pdb70/pdb70 in my download directory, instead, it is $data_dir/pdb70, so, I am confused and checking.

Thank you! Rong

sanjaysrikakulam commented 3 years ago

Hi @ryao-mdanderson,

Please let me know if you get an error or find out something is not working when you test it. Also, the bash script follows the docker run python script of AF2.

dldereklee commented 3 years ago

I'm a little confused as well. It seems to match what is in run_docker.sh, but I get this error when running run_alphafold.sh

ValueError: Could not find HHBlits database /reference/AlphaFold/uniclust30/uniclust30_2018_08/uniclust30_2018_08

When I check the download there doesn't seem to be a uniclust30_2018_08 directory

 > ls /reference/AlphaFold/uniclust30
uniclust30_2018_08_a3m_db.index  uniclust30_2018_08.cs219          uniclust30_2018_08.cs219.sizes   uniclust30_2018_08_hhm.ffindex
uniclust30_2018_08_a3m.ffdata    uniclust30_2018_08_cs219.ffdata   uniclust30_2018_08_hhm_db.index  uniclust30_2018_08_md5sum
uniclust30_2018_08_a3m.ffindex   uniclust30_2018_08_cs219.ffindex  uniclust30_2018_08_hhm.ffdata
sanjaysrikakulam commented 3 years ago

Hi @dldereklee

This is the AF2's directory structure,

$DOWNLOAD_DIR/                             # Total: ~ 2.2 TB (download: 438 GB)
    bfd/                                   # ~ 1.7 TB (download: 271.6 GB)
        # 6 files.
    mgnify/                                # ~ 64 GB (download: 32.9 GB)
        mgy_clusters_2018_12.fa
    params/                                # ~ 3.5 GB (download: 3.5 GB)
        # 5 CASP14 models,
        # 5 pTM models,
        # LICENSE,
        # = 11 files.
    pdb70/                                 # ~ 56 GB (download: 19.5 GB)
        # 9 files.
    pdb_mmcif/                             # ~ 206 GB (download: 46 GB)
        mmcif_files/
            # About 180,000 .cif files.
        obsolete.dat
    small_fbd/                             # ~ 17 GB (download: 9.6 GB)
        bfd-first_non_consensus_sequences.fasta
    uniclust30/                            # ~ 86 GB (download: 24.9 GB)
        uniclust30_2018_08/
            # 13 files.
    uniref90/                              # ~ 58 GB (download: 29.7 GB)
        uniref90.fasta

I am not sure how you have downloaded your data and why it is in a different directory structure. You can update the paths in the bash script (run_alphafold.sh) if your directory structure does not match AF2's directory structure.

ryao-mdanderson commented 3 years ago

Hi @sanjaysrikakulam

The download structure is really helpful. I just realize I don't have small_fbd directory downloaded.

Thanks!

ryao-mdanderson commented 3 years ago

Hi @sanjaysrikakulam 👍

I reviewed scripts directory, which have all the download sh script. The download_all_data.sh does not have code to download small_fbd directory. May I know how do you have this directory downloaded? How do I can get bfd-first_non_consensus_sequences.fasta?

Thanks!

sanjaysrikakulam commented 3 years ago

Hi @ryao-mdanderson

It looks like that download_all_data.sh has a conditional based download

if [[ "${DOWNLOAD_MODE}" = full_dbs ]] ; then
  echo "Downloading BFD..."
  bash "${SCRIPT_DIR}/download_bfd.sh" "${DOWNLOAD_DIR}"
else
  echo "Downloading Small BFD..."
  bash "${SCRIPT_DIR}/download_small_bfd.sh" "${DOWNLOAD_DIR}"
fi

I manually downloaded all the data using wget and rsync. I did not use AF2 download scripts.

ryao-mdanderson commented 3 years ago

@sanjaysrikakulam Thank you very much. I see. I git clone the directory on July 19, so that script/download_all_data.sh does not have this if - else condition and no download_small_bfd.sh. I re git clone a new version.

sanjaysrikakulam commented 3 years ago

@ryao-mdanderson

I think you don't need the small bfd if you download the bfd database.

ryao-mdanderson commented 3 years ago

@sanjaysrikakulam I am sorry for bother you again

Example run (Uses the GPU with index id 0 as default)

bash run_alphafold.sh -d ./alphafold_data/ -o ./dummy_test/ -m model_1 -f ./example/query.fasta -t 2020-05-14

May I know what is alphafold_data (-d flag) refer to in this example? Thanks!

sanjaysrikakulam commented 3 years ago

Hi @ryao-mdanderson

Its the download directory where you have all the AF2 required databases.