saccharis / SACCHARIS_2

CLI and GUI based bioinformatics pipeline to automate phylogenetic analysis of CAZyme families in FASTA sequences.
GNU General Public License v3.0
4 stars 0 forks source link

Databases downloaded to home directory #3

Open sivankij opened 9 months ago

sivankij commented 9 months ago

Hi,

Thank you for updating Saccharis! I've used the older perl version and found the tool useful and fun. Now, when trying to run SACCHARIS_2 after successfully installing it with conda, I noticed that the ncbi and cazy databases are downloaded to my home directory. I think this behavior stems from the AdvancedConfig.py file where you call ~ :

home_dir = os.path.expanduser('~')
folder_saccharis_user = os.path.join(home_dir, "saccharis")
folder_config = os.path.join(folder_saccharis_user, "config")
default_settings_path = os.path.join(folder_config, "advanced_settings.json")
folder_db = os.path.join(folder_saccharis_user, "db")
folder_logs = os.path.join(folder_saccharis_user, "logs")
folder_default_output = os.path.join(folder_saccharis_user, "output")

def get_db_folder():
    return folder_db

def get_log_folder():
    return folder_logs

def get_output_folder():
    return folder_default_output

def get_config_folder():
    return folder_config

The cluster we work on in the lab has strict storage limits for the home directory, is it possible to specify where I want the databases to be saved?

Thanks! Sivan

AlexSCFraser commented 9 months ago

Hi Sivan,

Currently I haven't added the ability to specify a custom database directory (or the rest of the config directory in general), but I think it's a good idea and I can incorporate it into the next update. I don't have an exact date planned for the next update as I am working on multiple projects, but likely between a couple weeks to a month.

For now, one workaround is to download the database files manually and then create a symlink for the db folder (file system level shortcut, essentially. Programs treat them the same as real files but they take up practically no space.) in the expected home directory location that points to the real location where more storage is available. I actually did this on a computing cluster I run SACCHARIS on for the same reason, limited home dir space, and didn't think to just make the db directory easily user configurable.

Instructions for manually downloading the database files and creating a symlink in linux:

Since you are installing on a cluster, I am assuming familiarity with the terminal.

Make sure blast+, hmmer, and diamond bioinformatics tools are installed and available on CLI (cluster might need to have them loaded).

To manually download the database files, this terminal command should suffice, copied with updates for latest files from the dbcan instructions. All you need to do it navigate to the desired folder first before running this.

test -d db || mkdir db
cd db \
    && wget  http://bcb.unl.edu/dbCAN2/download/Databases/fam-substrate-mapping-08252022.tsv \
    && wget http://bcb.unl.edu/dbCAN2/download/Databases/PUL.faa && makeblastdb -in PUL.faa -dbtype prot \
    && wget http://bcb.unl.edu/dbCAN2/download/Databases/dbCAN-PUL_07-01-2022.xlsx \
    && wget http://bcb.unl.edu/dbCAN2/download/Databases/dbCAN-PUL_07-01-2022.txt \
    && wget http://bcb.unl.edu/dbCAN2/download/Databases/dbCAN_sub.hmm && hmmpress dbCAN_sub.hmm \
    && wget http://bcb.unl.edu/dbCAN2/download/Databases/V12/CAZyDB.07262023.fa && diamond makedb --in CAZyDB.08062022.fa -d CAZy \
    && wget https://bcb.unl.edu/dbCAN2/download/Databases/V12/dbCAN-HMMdb-V12.txt && mv dbCAN-HMMdb-V12.txt dbCAN.txt && hmmpress dbCAN.txt \
    && wget https://bcb.unl.edu/dbCAN2/download/Databases/V12/tcdb.fa && diamond makedb --in tcdb.fa -d tcdb \
    && wget http://bcb.unl.edu/dbCAN2/download/Databases/V12/tf-1.hmm && hmmpress tf-1.hmm \
    && wget http://bcb.unl.edu/dbCAN2/download/Databases/V12/tf-2.hmm && hmmpress tf-2.hmm \
    && wget https://bcb.unl.edu/dbCAN2/download/Databases/V12/stp.hmm && hmmpress stp.hmm

Then, to create a symlink to the folder, run the link command: ln -s /path/to/db ~/saccharis/db

On my cluster environment this looked something like this, project is the large storage volume, but your filesystem layout might vary: ln -s /project/group-id/username/sacchaaris/db ~/saccharis/db

This should create a symlink in your home directory subfolder that saccharis is checking for the db files to the location in your main storage volume on the cluster. Assuming the files downloaded and formatted correctly and the symlink was created properly, the filesystem automatically redirects SACCHARIS to the other location when it tries to load the files.

Note that the ln command creates hard links by default, the "-s" (or alternately "--symbolic") flag is needed to specify creation of symbolic links. Since a computing cluster is using a distributed filesystem across multiple nodes I strongly recommend using symbolic links because hard links may not work correctly.