soedinglab / metaeuk

MetaEuk - sensitive, high-throughput gene discovery and annotation for large-scale eukaryotic metagenomics
GNU General Public License v3.0
178 stars 23 forks source link

Taxtocontig Error at Prefilter Step #26

Closed ys117vt closed 3 years ago

ys117vt commented 3 years ago

prefilter temp_tax/14471945088901788939/preds /work/cascades/.../database/tax/MMETSP_zenodo_3247846_uniclust90_2018_08_seed_valid_taxids temp_tax/14471945088901788939/tmp_taxonomy/4208998901951402961/tmp_hsp1/9481182838681733712/pref_0 --sub-mat nucl:nucleotide.out,aa:blosum62.out --seed-sub-mat nucl:nucleotide.out,aa:VTML80.out -k 0 --k-score 2147483647 --alph-size nucl:5,aa:21 --max-seq-len 65535 --max-seqs 300 --split 0 --split-mode 2 --split-memory-limit 0 -c 0 --cov-mode 0 --comp-bias-corr 1 --diag-score 1 --exact-kmer-matching 0 --mask 1 --mask-lower-case 0 --min-ungapped-score 15 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca 1 --pcb 1.5 --threads 24 --compressed 0 -v 3 -s 4.0

Query database size: 142988 type: Aminoacid Target split mode. Searching through 2 splits Estimated memory consumption: 149G Target database size: 88022300 type: Aminoacid Process prefiltering step 1 of 2

Index table k-mer threshold: 141 at k-mer size 7 Index table: counting k-mers [==============================================================temp_tax/14471945088901788939/tmp_taxonomy/4208998901951402961/tmp_hsp1/9481182838681733712/blastp.sh: line 99: 16219 Bus error $RUNNER "$MMSEQS" prefilter "$INPUT" "$TARGET" "$TMPPATH/pref$STEP" $PREFILTER_PAR -s "$SENS" Error: Prefilter died Error: First search died Error: taxonomy died

Does this related to no enough memory? Thank you!

Yang

Expected Behavior

Current Behavior

Steps to Reproduce (for bugs)

Please make sure to execute the reproduction steps with newly recreated and empty tmp folders.

MetaEuk Output (for bugs)

Please make sure to also post the complete output of MetaEuk. You can use gist.github.com for large output.

Context

Providing context helps us come up with a solution and improve our documentation for the future.

Your Environment

Include as many relevant details about the environment you experienced the bug in.

ys117vt commented 3 years ago

Assuming the error is from the blastp.sh step: "if notExists "$TMPPATH/pref$STEP.dbtype"; then

shellcheck disable=SC2086

    $RUNNER "$MMSEQS" prefilter "$INPUT" "$TARGET" "$TMP_PATH/pref_$STEP" $PREFILTER_PAR -s "$SENS" \
        || fail "Prefilter died"

Anything I could try to avoid this kind of error? Thanks!

Yang

ys117vt commented 3 years ago

Hey @milot-mirdita @elileka

Any thoughts on this? Thanks!

elileka commented 3 years ago

Sorry for not responding. We are a bit perplexed by the bus error. It can be many things... Are there anymore details you can provide us? Can you generate much smaller datasets (especially for the reference database) and see if they run through? You could use the databases command with UniProtKB/Swiss-Prot for example to get a small reference db. If something smaller runs through it may indicate some resource limitation (memory, disc, etc.)

ys117vt commented 3 years ago

Sorry for not responding. We are a bit perplexed by the bus error. It can be many things... Are there anymore details you can provide us? Can you generate much smaller datasets (especially for the reference database) and see if they run through? You could use the databases command with UniProtKB/Swiss-Prot for example to get a small reference db. If something smaller runs through it may indicate some resource limitation (memory, disc, etc.)

Thank you @elileka. I did tried to use the reference database of UniProtKB/Swiss-Prot and it did work. Thank you again for your direction!

Yang

milot-mirdita commented 3 years ago

You probably should not actually use UniProtKB/Swiss-Prot for taxonomic annotation. It small size is very convenient for testing, but it's highly biased towards the most studied organisms.

I am still quite confused how the bus error can happen. Are you running multiple jobs on the same machine that are competing for RAM?

ys117vt commented 3 years ago

You probably should not actually use UniProtKB/Swiss-Prot for taxonomic annotation. It small size is very convenient for testing, but it's highly biased towards the most studied organisms.

I am still quite confused how the bus error can happen. Are you running multiple jobs on the same machine that are competing for RAM?

Thanks for the reply @milot-mirdita. I actually tried the taxtocontig with the UniProtKB reference database and it died again with similar bus error. I was using external research super computer to run the code and I would assume it should have enough memory. But it looks like it's very likely that the issue is with the memory...

I am trying to run as a batch job with a specified node and see if I can pass the Index table building step. Thanks!

elileka commented 3 years ago

Do you roughly know the taxonomic group of your contigs? (Is it, for example, a sample of some algae?) If so, perhaps we could assist with constructing a leaner reference database for the exact taxonomic annotation.

ys117vt commented 3 years ago

Do you roughly know the taxonomic group of your contigs? (Is it, for example, a sample of some algae?) If so, perhaps we could assist with constructing a leaner reference database for the exact taxonomic annotation.

Hi @elileka, my samples are drinking water metagenomic samples. I used kraken/braken to annotate them but would like to know more about the eukaryotes in my samples as kraken/braken didn't give me much information for drinking water amoebae.

elileka commented 3 years ago

If you have a subset of contigs you are certain are eukaryotic, you could try to annotate them against a euk only reference database (or even only amoebae or any other clade, if it makes sense) this would save the resources "wasted" on the prokaryotic part of the reference database and might make the run feasible on a more limited machine. To do so, you will need to filter your taxonomic reference database as detailed here. Any valid NCBI TAXID can be used for filtering.

ys117vt commented 3 years ago

If you have a subset of contigs you are certain are eukaryotic, you could try to annotate them against a euk only reference database (or even only amoebae or any other clade, if it makes sense) this would save the resources "wasted" on the prokaryotic part of the reference database and might make the run feasible on a more limited machine. To do so, you will need to filter your taxonomic reference database as detailed here. Any valid NCBI TAXID can be used for filtering.

Thank you Eli! @elileka @milot-mirdita Yeah, the submitted batch job stopped for the same bus error. I will try to work with our computation service team to figure out a solution. What's the recommended memory/RAM for this kind of job? Maybe I need to apply for multiple nodes to run this. At the end, I would try to reduce the reference database and give it another try. Thank you!

ys117vt commented 3 years ago

Hi @milot-mirdita @elileka , I was able to run with UniprotKB reference database with extended allocation of cpu/memory with our remote computer. I think it is all good now. Thank you again for your help!

Yang