jtamames / SqueezeMeta

A complete pipeline for metagenomic analysis
GNU General Public License v3.0
374 stars 80 forks source link

Option to continue making database #875

Closed Harpreet525 closed 2 months ago

Harpreet525 commented 2 months ago

Hello

I am trying to make the database from fresh but the cluster allows only 48hrs of usage per run. Due to this restriction i am unable to construct the database. Downloading nr database takes time and then constructing nr will take more time and my run keeps getting killed as soon as i cross 48hrs of walltime.

Could you please provide a solution to this.

Best Regards Harpreet

fpusan commented 2 months ago

Hi,

At the beginning of the make_databases.pl there are some global variables that control which steps of database generation will be run. You can set the ones prior to DOWNLOAD_NR to 0 instead of 1. This will save time in the next run (you will have to re-run the script with the same target location of course).

If you already dowloaded and decompressed the fasta file for the nr database, there should be a nr.fasta file in your database directory. If this is the case, you can also edit the file in /path/to/SQM/environment/SqueezeMeta/lib/install_utils/make_nr_db_2020.pl and comment out the following lines before re-running make_databases.pl.

#-- Getting the raw files from NCBI. This can take long and need 100 Gb disk space
my $command="wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz -P $databasedir";
system $command;

#-- Format the database
my $command = "gunzip $databasedir/nr.gz && mv $databasedir/nr $fastadb";
my $ecode = system $command;
if($ecode!=0) { die "Error running command:     $command\n\nThis probably means that your download got interrupted, or that you ran out of disk space"; }

This will avoid re-downloading the fasta file and go straight into DIAMOND. If you still run out of wall time after trying this, then I don't think I can help you and you will need to talk to your administrators. Some of the final steps may hang or take too long if your filesystem has low latency, I usually run make_databases.pl on an SSD to avoid this.

Best of lucks!

Harpreet525 commented 2 months ago

Thank you. I will try this method. I was also trying to replace wget with aria2c to increase download speed with multiple connections

I saw that some of the databases are of 2021, is it possible to download and make the latest versions ?

Best Regards

fpusan commented 2 months ago

No, some of them (KEGG) have no publicly available newer versions. In other cases (eggNOG) there are, but we would need to update some of our scripts to parse them and we haven't got there yet,

Harpreet525 commented 2 months ago

Ok. So i have managed to make the database and replaced all wget with aria2c which significantly improved my construction and used the global option to construct nr separately. But now i am confused with the coassembly parameter of the tool. when i ran the squeezemeta.pl script, it recognised 22 metagenomes for coassembly mode but then it is not concatenating the fastq files into 22 different samples, instead it is concatenating all the samples into pair1 and pair2. This is not what i want as my 22 samples are very different and at different time points. Did i do something wrong ? Though it correctly found 22 sampples as below:

22 metagenomes found: SNOW_W_S1 ICE_W_S2 SOIL_W_S3 LAKE_WATER_W_S4 LAKE_ICE_U10_W_S5 LAKE_ICE_D30_W_S6 LAKE_SNOW_W_S7 LAKE_SLUSH_W_S8 FJORD_WATER_W_S9 SNOW_GVB_W_S10 BLANK_W_S11 SNOW_SURF_S_S12 SNOW_MID_S_S13 SNOW_BASE_S_S14 SNOW_SURF5_S_S15 SNOW_MID6_S_S16 SNOW_BASE7_S_S17 ICE_S_S18 SOIL_S_S19 ICE_F_S20 SOIL_F_S21 BLANK_S_S22

fpusan commented 2 months ago

If your samples are different then you should now run them together into a coassembly. Either separate your samples in groups and coassemble each group in separate, or just run SqueezeMeta for each sample. Closing this issue as you managed to make the database (congrats, btw!) Feel free to open a different issue if you have other questions