liberjul / CONSTAXv2

MIT License
8 stars 2 forks source link

RDP Classifier consistently encounters java.lang.OutOfMemoryError with Eukaryotes #9

Closed szymanskishay closed 1 year ago

szymanskishay commented 1 year ago

I am attempting to assign taxonomy to Eukaryotes and repeatedly encounter an error with memory allotment, from what I can tell at least. I think it might be related to needing to use the duplicate taxon name fix but I am unsure. Slight sidebar, but I do really appreciate that this has built in a way to deal with the duplicate taxa problem. I have already tried going to my CTAX2 environment to change the default for -Xmx to allow more memory (from 1GB to 2GB), but a source I found online seemed to suggest that it couldnt go past 2GB very well, so I am hesitant to alter it further.

Relevant excerpt from the slurm below.

Training RDP Classifier java -Xmx -jar /mnt/research/rdp/public/RDPTools/classifier.jar train -o /mnt/home/shemans6/Cecilia/db///. -s /mnt/home/shemans6/Cecilia/db///sh_general_release_dynamic_all_29.11.2022_dev__RDP_trained.fasta -t /mnt/home/shemans6/Cecilia/db///sh_general_release_dynamic_all_29.11.2022_dev__RDP_taxonomy_trained.txt > rdp_train.out 2>&1 edu.msu.cme.rdp.classifier.train.NameRankDupException: Error: duplicate taxon name and rank in the taxonomy file. crambe genus 2 opercularia genus 2 phialina genus 2 gomphus genus 2 cynodon genus 2 mertensia genus 2 stokesia genus 2 globodera genus 2 alaria genus 2 chondrilla genus 2 rhytisma genus 2 siphonaria genus 2

at edu.msu.cme.rdp.classifier.train.TreeFactory.creatTaxidMap(TreeFactory.java:126)
at edu.msu.cme.rdp.classifier.train.TreeFactory.<init>(TreeFactory.java:61)
at edu.msu.cme.rdp.classifier.train.ClassifierTraineeMaker.<init>(ClassifierTraineeMaker.java:63)
at edu.msu.cme.rdp.classifier.train.ClassifierTraineeMaker.main(ClassifierTraineeMaker.java:170)
at edu.msu.cme.rdp.classifier.cli.ClassifierMain.main(ClassifierMain.java:77)

RDP training error, redoing with duplicate taxa python /mnt/home/shemans6/anaconda3/envs/CTAX2/opt/constax-2.0.18-0/FormatRefDB.py -d /mnt/home/shemans6/Cecilia/db/sh_general_release_dynamic_all_29.11.2022_dev.fasta -t /mnt/home/shemans6/Cecilia/db// -f UNITE -p /mnt/home/shemans6/anaconda3/envs/CTAX2/opt/constax-2.0.18-0 --dup Importing subscripts from /mnt/home/shemans6/anaconda3/envs/CTAX2/opt/constax-2.0.18-0


Reformatting database

UNITE format detected

Reference database FASTAs formatted in 2.699999325 seconds...

Training Taxonomy

Duplicate taxa being handled with numerical suffices

Adding Full Lineage

Database formatting complete


java -Xmx -jar /mnt/research/rdp/public/RDPTools/classifier.jar train -o /mnt/home/shemans6/Cecilia/db///. -s /mnt/home/shemans6/Cecilia/db///sh_general_release_dynamic_all_29.11.2022_dev__RDP_trained.fasta -t /mnt/home/shemans6/Cecilia/db///sh_general_release_dynamic_all_29.11.2022_dev__RDP_taxonomy_trained.txt > rdp_train.out 2>&1 Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at edu.msu.cme.rdp.classifier.train.RawHierarchyTree.initWordOccurrence(RawHierarchyTree.java:124) at edu.msu.cme.rdp.classifier.train.TreeFactory.addSequencewithLineage(TreeFactory.java:273) at edu.msu.cme.rdp.classifier.train.TreeFactory.parseSequenceFile(TreeFactory.java:152) at edu.msu.cme.rdp.classifier.train.ClassifierTraineeMaker.(ClassifierTraineeMaker.java:65) at edu.msu.cme.rdp.classifier.train.ClassifierTraineeMaker.main(ClassifierTraineeMaker.java:170) at edu.msu.cme.rdp.classifier.cli.ClassifierMain.main(ClassifierMain.java:77) Command 'bash -c '/mnt/home/shemans6/anaconda3/envs/CTAX2/opt/constax-2.0.18-0/constax_no_inputs.sh'' returned non-zero exit status 1.

When rdp_train.out gets written, it is just "Exception in thread "main"..." and the things indented under it. What is more curious is that it still makes RDP relevant files (ie: sh_general_release....RDP_taxonomy.txt, ..._RDP_taxonomy_headers.txt, ..._RDP_taxonomy_trained.txt, ..._RDP_trained.fasta) It seems to not make "training_check.txt", which causes a failure slightly further down (below)

CONSTAX2: Improved taxonomic classification of environmental DNA markers Julian Aaron Liber, Gregory Bonito, Gian Maria Niccolò Benucci Bioinformatics, Volume 37, Issue 21, 1 November 2021, Pages 3941–3943; doi: https://doi.org/10.1093/bioinformatics/btab347 Cannot classify without existing training files, please specify -t Command 'bash -c '/mnt/home/shemans6/anaconda3/envs/CTAX2/opt/constax-2.0.18-0/constax_no_inputs.sh'' returned non-zero exit status 1. grep: /mnt/home/shemans6/Cecilia/db///training_check.txt: No such file or directory

I am trying some fixes out currently but would love to hear anything that could help regardless.

Additional info: I am using the UNITE general eukaryotic data base release "sh_general_release_dynamic_all_29.11.2022_dev.fasta". I am using this on the MSU HPCC as well.

Gian77 commented 1 year ago

Hello @szymanskishay,

can you tell me how much RAM are you using in the SLURM job you are submitting? You can copy an dpaste your job script here to better debug. Thanks, G.

szymanskishay commented 1 year ago

Hello @szymanskishay,

can you tell me how much RAM are you using in the SLURM job you are submitting? You can copy an dpaste your job script here to better debug. Thanks, G.

Here is the script itself,

!/bin/bash -login

SBATCH --time=10:00:00

SBATCH --nodes=1

SBATCH --ntasks=1

SBATCH --cpus-per-task=8

SBATCH --mem=32G

SBATCH --job-name 141eukotu

cd ${SLURM_SUBMIT_DIR} cores=$SLURM_CPUS_PER_TASK RAM=$SLURM_MEM_PER_NODE

echo -e "Cecilia v.1.0 - usearCh basEd ampliCon pIpeLine for Illumina dAta MIT LICENSE - Copyright © 2022 Gian M.N. Benucci, Ph.D. email: benucci[at]msu[dot]edu\n"

source ../config.yaml

echo -e "\n========== Sub-directories ==========\n" echo "mkdir -p $project_dir/outputs/14_constax_euk/"; mkdir -p $project_dir/outputs/14_constax_euk/ echo "cd $project_dir/outputs/11_clustered_otu_asv_usearch/"; cd $project_dir/outputs/11_clustered_otu_asv_usearch/

echo -e " \n========== Train reference database ==========\n" conda activate CTAX2

echo -e "CONSTAX version: constax --version"

if [[ "$train_db" == "yes" ]]; then echo -e "Training the reference Database" first_file=$(ls $project_dir/outputs/11_clustered_otu_asv_usearch/otus*.fasta | head -1) cat $first_file | awk "/^>/ {n++} n>2 {exit} {print}" > $project_dir/outputs/14_constax_euk/train.fasta constax \ --num_threads $cores \ --mem $RAM \ --db $fun_db \ --input $project_dir/outputs/14_constax_euk/train.fasta \ --train \ --trainfile $trainfiles_euk/ \ --output /$project_dir/outputs/14_constax_euk/ \ --blast

elif [[ "$train_db" == "no" ]]; then echo -e "You reference database is already trained. Skipping!" fi

echo -e "\n Training files as below:\n ls -l $trainfiles_euk/\n"

for file in otus*.fasta; do echo -e " \n========== Assigning taxonomy for $file ==========\n" file_name=$( echo $file | cut -f 1 -d"." )

constax \
    --num_threads $cores \
    --mem $RAM \
    --db $fun_db \
    --trainfile $trainfiles_euk/ \
    --input $file \
    --isolates $bonito_isolates \
    --isolates_query_coverage=90 \
    --isolates_percent_identity=90 \
    --high_level_db $euk_db \
    --high_level_query_coverage=60 \
    --high_level_percent_identity=60 \
    --tax /$project_dir/outputs/14_constax_euk/constax_${file_name}/ \
    --output /$project_dir/outputs/14_constax_euk/constax_${file_name}/ \
    --conf $constax_conf \
    --blast 

done

conda deactivate

echo -e "\n========= Sbatch log =========\n" echo -e "\n Current directory: pwd \n" echo -e "\n sacct -u $MSUusername -j $SLURM_JOB_ID --format=JobID,JobName,Start,End,Elapsed,NCPUS,ReqMem \n" scontrol show job $SLURM_JOB_ID mv $project_dir/code/slurm-$SLURM_JOB_ID* $project_dir/slurms/14.1_EukTaxonomy_OTU_constax.slurm

as a sidenote, the "bonito_isolates" is defined in the config file to something else, I just kept the name from for ease-sake fun_db is the fungal sequences only release from UNITE, and euk_db is the all eukaryotes release in this instance

Gian77 commented 1 year ago

Please give this a try

#SBATCH --cpus-per-task=32
#SBATCH --mem=128G
szymanskishay commented 1 year ago

Please give this a try

#SBATCH --cpus-per-task=32
#SBATCH --mem=128G

This appears to have worked! Thank you!

Gian77 commented 1 year ago

Hello @szymanskishay,

yes, you sould use a different file for the --isolates. Maybe you have some ITS sequences of the contaminanst present in the lab where the library was prepared? Or you can use culture isolates sequences (that is the original purpose of it).

I am working on a solution for your problem, will come up soon with a new fix of this https://github.com/Gian77/Cecilia soon.

Thanks, Gian

liberjul commented 1 year ago

Thanks for your question @szymanskishay, and thanks for addressing it @Gian77. It seems like we've resolved the issue, so I'm closing.