Problem with setting up Diamond databases

josuebarrera / GenEra

genEra is a fast and easy-to-use command-line tool that estimates the age of the last common ancestor of protein-coding gene families.

GNU General Public License v3.0

46 stars 6 forks source link

Problem with setting up Diamond databases #13

Closed CWYuan08 closed 11 months ago

CWYuan08 commented 1 year ago

Hi I am stuck at the step: diamond makedb --in nr --db nr --taxonmap prot.accession2taxid --taxonnodes taxdump/nodes.dmp --taxonnames taxdump/names.dmp --memory-limit 100

I got the error:
Error: Option is not permitted for this workflow: memory-limit

but if I ran with diamond makedb --in nr --db nr --taxonmap prot.accession2taxid --taxonnodes taxdump/nodes.dmp --taxonnames taxdump/names.dmp

I got the error: Accession parsing rules triggered for database seqids (use --no-parse-seqids to disable): UniRef prefix 0 gi|xxx| prefix 0 xxx| prefix 31111 |xxx suffix 31111 .xxx suffix 918334653 :PDB= suffix 0

Loading taxonomy names... [1.24s] Loaded taxonomy names for 2514621 taxon ids. Loading taxonomy mapping file... Error opening file prot.accession2taxid: No such file or directory

Any advises on this? Many thanks!

Best, CW

josuebarrera commented 1 year ago

Dear @CWYuan08,

I find it weird that there is no option for memory-limit, please check if you are using version 2.0 or higher of DIAMOND. Regarding your second error, make sure you download prot.accession2taxid before generating the DIAMOND database:

wget ftp://ftp.ncbi.nih.gov:21/pub/taxonomy/accession2taxid/prot.accession2taxid.gz && gunzip prot.accession2taxid.gz

And then specify the path to that file in --taxonmap. Please let me know if that worked for you or if you have any further questions.

Best regards, Josué.

CWYuan08 commented 1 year ago

Dear @josuebarrera, yes this has worked. I am also wondering about the genEra runtime, I am running it for our model species which is a fish, is it normal if it takes days to run? Thank you very much! Best, CW

josuebarrera commented 1 year ago

Dear @CWYuan08, The running time of GenEra largely depends on the number of threads you use to run the pipeline and the number of genes in the query species. A fungal genome with ~6,000 genes might take a few hours with 10 CPUs, while a plant genome with >40,000 genes might take a few days with 30 CPUs. We're currently planning to reduce computational times so GenEra can scale smoothly with large genomes. Best, Josué.

CWYuan08 commented 1 year ago

Dear @josuebarrera,

thank you! I have submitted a genEra on our cluster to run, since 2nd Aug, the job is running but the temp files have not been changing since 2nd Aug, is it still running?

Best. CW

josuebarrera commented 1 year ago

Dear @CWYuan08, Ok, that doesn't sound right. Could you please send me the log file (output from stdout) of your GenEra run so I can take a look at the problem? Best, Josué.

CWYuan08 commented 1 year ago

Dear @josuebarrera,

thank you! The log files was stuck at: genEra v1.2.0 (C) Max Planck Society for the Advancement of Science Starting time of run: Tue 1 Aug 23:56:44 BST 2023

Your temporary files will be stored in /tmp_8154_24579

STARTING STEP 1: SEARCHING FOR HOMOLOGS WITHIN THE DATABASE USING DIAMOND

Matching the query genes against themselves

Searching for homologs against the DIAMOND database

and the temp directory looks like:

Best, CW

josuebarrera commented 1 year ago

Dear @CWYuan08, The software seems to work correctly, but it is taking an unreasonable amount of time to retrieve the homologs of your query genes in the NR. Could you send me the genEra command that you used, so I can try to replicate your problem? Best, Josué.

CWYuan08 commented 1 year ago

Dear @josuebarrera,

thank you very much, I am using the very basic command: genEra -q /reference_genomes/ref.pep.all.fa -t 8154 -b nr -d taxdump

Best, CW

josuebarrera commented 1 year ago

Dear @CWYuan08, I'll try to replicate your issue with the command you gave me and let you know the results. But from the start, I suspect the reason for the delay is that your organism has 52,718 predicted proteins. I would suggest you run GenEra with more CPUs, if possible, using the argument -n (~40 CPUs would be ideal). Best, Josué.

josuebarrera commented 1 year ago

Dear @CWYuan08, The analysis just finished running in our cluster. It ran from 08/09/2023 at 22:45:12 to 08/11/2023 at 15:32:50 using 40 CPUs. I don't think it would have taken more than an additional day for GenEra to run with the default number of 20 CPUs, so I think your problem cannot be explained by the difference in the number of default CPUs. Could it be that you didn't allocate the CPUs in your machine before running GenEra? I suspect the pipeline is trying to run with 20 CPUs but it is restricted by the queue system of your machine to just 1 CPU. I recommend you specify the number of CPUs directly with -n and also verify that you specify the number of CPUs in your machine, which might help you fix your problem. Since I already ran the analysis, I can also send you the results. You can download them here. Best, Josué.

CWYuan08 commented 1 year ago

Dear @josuebarrera, thank you very much! My job is still running now (at step3), thanks for sharing the results! Best, CW

josuebarrera commented 11 months ago

Dear @CWYuan08 ,

I apologize for the >1 month reply. We just released GenEra v1.4.0, which runs MUCH faster in step 3 of the pipeline, in case you are still stuck getting your results. You can download the latest version here and give it a try!

We hope that this new implementation can help you get your results much faster. Please, let us know if you run into any more problems while running GenEra.

Cheers, Josué.