josuebarrera / GenEra

genEra is a fast and easy-to-use command-line tool that estimates the age of the last common ancestor of protein-coding gene families.
GNU General Public License v3.0
46 stars 6 forks source link

DIAMOND search step segfault #6

Closed glarue closed 1 year ago

glarue commented 1 year ago

First, thanks for the software! I've run a similar (less thorough) analysis "by hand" in the past, and it's exciting to have this sort of pipeline made more accessible.

I'm trying to run GenEra following the installation setup described in the documentation, using the RefSeq human proteome vs a local version of the nr db. I'm running on a small server with 32GB RAM (could be part of the issue?), Ubuntu 22.10, DIAMOND v2.1.6, GenEra v1.1.0.

The first step runs for ~6 hrs, and produces the following error:

/home/glarue/miniconda3/envs/genEra/bin/genEra: line 701: 131901 Segmentation fault      (core dumped) diamond blastp --${SENSITIVITY} --query ${QUERY_FASTA} --db ${NR_DB} --outfmt 6 qseqid sseqid evalue bitscore staxids --evalue ${EVALUE} --max-target-seqs 0 --threads ${THREADS} --out ${TMP_PATH}/${NCBITAX}_Diamond_prefiltered_results.bout --quiet ${DIAMONDOPTS}

  ERROR: DIAMOND didn't run properly, verify that the database was built correctly
  Exiting

The _results.bout file seems to have been created successfully (~500 MB), as is an .abc file, but the _prefiltered_results.bout file is empty. My first thought was that this could be a memory issue, but the search itself seems to have completed successfully so I'm not sure that makes sense.

Any help is appreciated—thanks!

josuebarrera commented 1 year ago

Dear Graham, Thank you for your interest in using GenEra! Judging by the prompted error, I suspect that you might have ran out of either RAM or storage space in your hard drive. GenEra first aligns the query proteome against itself, and this information is stored in _results.bout, which explains the small size of the file. But the DIAMOND run against the NR definelty collapsed, since _prefiltered_results.bout is empty. The human refseq proteome is highly redundant, containing 136193 sequences, so given the limited computational resources, I suggest you remove all the redundant sequences from the query proteome (i.e., isoforms). May I ask for your command so I can try to replicate your error? Best, Josué.

glarue commented 1 year ago

Thanks for the quick response, Josué. I suspect you're right that it's a memory issue, although it should be noted that I'm running on an in-house-generated human proteome (based on the RefSeq annotations) that includes only the longest isoform of each gene (~23k sequences).

I realize that if this is a memory issue it's not really something for you to troubleshoot, as it would be more of a DIAMOND issue. I'm confused, however, why DIAMOND hits this issue given that I believe the default memory limit is 16 GB? I assume that intermediate results that don't fit in memory beyond that set limit can be cached to disk, and therefore even for a large DB I would imagine the search could still be run in memory-constrained environments. Definitely possible (perhaps likely!) that I am missing something about the way the search is performed, however.

For what it's worth, here's the command,

genEra -t 9606 -q hsapiens_ba0.pro.fa -b /mnt/data/GenEra_data/nr -d /mnt/data/GenEra_data/taxdump -i -n 9 -o genEra_out

and here is the query file I used.

josuebarrera commented 1 year ago

Dear Graham, I ran the command that you sent me with your query file. First I tried running GenEra by allocating the same resources you used (32GB RAM and 9 threads). The pipeline also crashed during the DIAMOND run, as you described it. Apparently, there is an option to limit the amount of memory that DIAMOND uses, but it crashed regardless. Then I tried running it with more resources (200GB RAM and 20 threads) and it finished smoothly in ~31 hours. So it is definitely a problem with DIAMOND asking for more RAM. A possible solution could be to use a smaller database than the NR (e.g., RefSeq) if running on a cluster with limited memory. Since I already ran the pipeline, I can also send you the results of the analysis. Best, Josué.

glarue commented 1 year ago

Hi @josuebarrera, thanks so much for looking into this, and especially for sending along the results.

During troubleshooting I've attempted to run the DIAMOND step directly on its own and interestingly, while it still fails, it takes much longer to do so (> 24 hrs vs. < 6 when run via GenEra). I've opened an issue in the DIAMOND repo to clarify why it's failing, and will update here with anything relevant.

Regarding your suggestion of running against a smaller database, I was hoping you could expound a bit on the consequences of that as far as downstream analysis goes. My sense is, if I understand the method correctly, that running the query against a smaller database (e.g., RefSeq) would make the output more "coarse-grained", in that there would be fewer datapoints to use to inform the age assignment of each gene in the query set.

Is it fair to say the results would in general be more conservative, in that for a given gene, it's possible that a more distant hit would be found were the search being performed against a larger database (i.e., nr)? If so, is it true that that the relative rank-order of gene ages within the query set should be largely consistent when running on a small or large database? Most of what I care about for this analysis is relative rather than absolute age assignments, but I want to be sure that I understand the caveats involved.

josuebarrera commented 1 year ago

Dear @glarue,

These are the main differences when running GenEra using smaller databases: 1) Some taxonomically relevant sequences from the NR may be missing from smaller databases, which may underestimate the age of some genes. This may be particularly true for taxonomically-restricted genes at young taxonomic levels, where some potential homologs might be present in the NR and not in smaller databases. 2) The NR is known to contain a considerable amount of misclassified sequences, so using a more curated database will likely improve the correct age assignment of genes with a low taxonomic representativeness score. This is particularly true for very old genes (e.g., Eukaryota-level genes), where it is difficult to assess if the presence of bacterial genes are attributed to common ancestry or database contamination. 3) Running times are faster when using smaller databases during the DIAMOND search and during the gene age assignment of genes (given that the DIAMOND results table is usually smaller).

Other than that, the results are expected to be largely consistent between databases, since the general gene age trends we obtained using GenEra are consistent with the trends found by Tautz & Domazet-Lozo in 2011 using a much smaller database. Overall, I could argue that a slightly smaller but curated database will be better for accurate gene age estimations of older genes, while evolutionarily young genes might miss some potential homologs that are present in the NR but not on curated databases. I hope this lengthy answer helps you make the right choice for your research purposes. Best, Josué.

glarue commented 1 year ago

@josuebarrera Thanks for the thorough response - sounds like smaller databases will fit the bill for my purposes.