ncbi / vadr

Viral Annotation DefineR: classification and annotation of viral sequences based on RefSeq annotation
Other
99 stars 23 forks source link

VADR exits due to limited memory, but doesn't report this to the user #48

Closed taltman closed 2 years ago

taltman commented 2 years ago

When processing a Coronavirus genome, I encountered the following error. Any hints?

# v-annotate.pl :: classify and annotate sequences using a model library
# VADR 1.3 (Aug 2021)
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# date:              Tue Dec 14 07:10:37 2021
# $VADRBIOEASELDIR:  /root/vadr/Bio-Easel
# $VADRBLASTDIR:     /root/vadr/ncbi-blast/bin
# $VADREASELDIR:     /root/vadr/infernal/binaries
# $VADRINFERNALDIR:  /root/vadr/infernal/binaries
# $VADRMODELDIR:     /root/vadr/vadr-models-calici
# $VADRSCRIPTSDIR:   /root/vadr/vadr
#
# sequence file:                                                                  /darth/outputs/MalbNV/transeq/canonical.fna
# output directory:                                                               /darth/outputs/MalbNV/SRR10402291
# force directory overwrite:                                                      yes [-f]
# leaving intermediate files on disk:                                             yes [--keep]
# .cm, .minfo, blastn .fa files in $VADRMODELDIR start with key <s>, not 'vadr':  corona [--mkey]
# model files are in directory <s>, not in $VADRMODELDIR:                         /root/data/vadr-models-corona-1.3-3 [--mdir]
# set max allowed memory for cmalign to <n> Mb:                                   64000 [--mxsize]
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# Validating input                                                                        ... done. [    0.1 seconds]
# Classifying sequences (1 seq)                                                           ... done. [  151.2 seconds]
# Determining sequence coverage (NC_045512: 1 seq)                                        ... done. [   18.6 seconds]
# Aligning sequences (NC_045512: 1 seq)                                                   ... Killed

ERROR in vdr_CmalignCheckStdOutput, cmalign /darth/outputs/MalbNV/SRR10402291/SRR10402291.vadr.NC_045512.align.r1.s0.stdout exists but is empty
taltman commented 2 years ago

Here are some example inputs that trigger this issue for me (GitHub won't accept a file extension of ".fna", so I've tacked on a ".txt" as well): MalbNV.fna.txt SilNV.fna.txt

nawrockie commented 2 years ago

It looks like you've hit a memory limit of some kind based on the 'Killed' in the output. For coronavirus, 64G of RAM is recommended. One way around this is to add the --glsearch option which reduces the memory requirement considerably. And using -s and --glsearch reduces it even further and results in much faster processing. I recommend you try that and see how it goes. We use -s and --glsearch for SARS-CoV-2.

taltman commented 2 years ago

Argh, I forgot to set up my swap space before running the container. That explains the memory issue. Thanks for pointing that out.

Renaming this issue to focus on the need for VADR to clearly communicate to the user that there is a memory limitation issue.

I think the docs could be more explicit about the expected memory requirements for running VADR in different ways. And the error messages should clearly state that there was insufficient memory for the process to continue.

nawrockie commented 2 years ago

Ok, thanks for the suggestion, I will look into updating the error message.

I do still suggest trying to run with -s --glsearch if you continue to have problems with memory. Recommended options for SARS-CoV-2 are in step 4 after following this link:

https://github.com/ncbi/vadr/wiki/Coronavirus-annotation#how-to-annotate-sars-cov-2-sequences-with-vadr-v13-or-later-version-1

And how to modify those command-line options to use the 'corona' set of models that you are using is explained here:

https://github.com/ncbi/vadr/wiki/Coronavirus-annotation#-identifying-and-annotating-coronaviridae-sequences-other-than-sars-cov-2-using-a-larger-vadr-model-library

nawrockie commented 2 years ago

VADR 1.4.1 outputs a more informative error message in this situation by appending:

v-annotate.pl may have run out of available memory, especially if you see a 'Killed' message.