Closed szymanskishay closed 1 year ago
Hello @szymanskishay,
can you tell me how much RAM are you using in the SLURM
job you are submitting?
You can copy an dpaste your job script here to better debug.
Thanks,
G.
Hello @szymanskishay,
can you tell me how much RAM are you using in the
SLURM
job you are submitting? You can copy an dpaste your job script here to better debug. Thanks, G.
Here is the script itself,
cd ${SLURM_SUBMIT_DIR} cores=$SLURM_CPUS_PER_TASK RAM=$SLURM_MEM_PER_NODE
echo -e "Cecilia v.1.0 - usearCh basEd ampliCon pIpeLine for Illumina dAta MIT LICENSE - Copyright © 2022 Gian M.N. Benucci, Ph.D. email: benucci[at]msu[dot]edu\n"
source ../config.yaml
echo -e "\n========== Sub-directories ==========\n" echo "mkdir -p $project_dir/outputs/14_constax_euk/"; mkdir -p $project_dir/outputs/14_constax_euk/ echo "cd $project_dir/outputs/11_clustered_otu_asv_usearch/"; cd $project_dir/outputs/11_clustered_otu_asv_usearch/
echo -e " \n========== Train reference database ==========\n" conda activate CTAX2
constax --version
"if [[ "$train_db" == "yes" ]]; then echo -e "Training the reference Database" first_file=$(ls $project_dir/outputs/11_clustered_otu_asv_usearch/otus*.fasta | head -1) cat $first_file | awk "/^>/ {n++} n>2 {exit} {print}" > $project_dir/outputs/14_constax_euk/train.fasta constax \ --num_threads $cores \ --mem $RAM \ --db $fun_db \ --input $project_dir/outputs/14_constax_euk/train.fasta \ --train \ --trainfile $trainfiles_euk/ \ --output /$project_dir/outputs/14_constax_euk/ \ --blast
elif [[ "$train_db" == "no" ]]; then echo -e "You reference database is already trained. Skipping!" fi
echo -e "\n Training files as below:\n ls -l $trainfiles_euk/
\n"
for file in otus*.fasta; do echo -e " \n========== Assigning taxonomy for $file ==========\n" file_name=$( echo $file | cut -f 1 -d"." )
constax \
--num_threads $cores \
--mem $RAM \
--db $fun_db \
--trainfile $trainfiles_euk/ \
--input $file \
--isolates $bonito_isolates \
--isolates_query_coverage=90 \
--isolates_percent_identity=90 \
--high_level_db $euk_db \
--high_level_query_coverage=60 \
--high_level_percent_identity=60 \
--tax /$project_dir/outputs/14_constax_euk/constax_${file_name}/ \
--output /$project_dir/outputs/14_constax_euk/constax_${file_name}/ \
--conf $constax_conf \
--blast
done
conda deactivate
echo -e "\n========= Sbatch log =========\n"
echo -e "\n Current directory: pwd
\n"
echo -e "\n sacct -u $MSUusername -j $SLURM_JOB_ID --format=JobID,JobName,Start,End,Elapsed,NCPUS,ReqMem
\n"
scontrol show job $SLURM_JOB_ID
mv $project_dir/code/slurm-$SLURM_JOB_ID* $project_dir/slurms/14.1_EukTaxonomy_OTU_constax.slurm
as a sidenote, the "bonito_isolates" is defined in the config file to something else, I just kept the name from for ease-sake fun_db is the fungal sequences only release from UNITE, and euk_db is the all eukaryotes release in this instance
Please give this a try
#SBATCH --cpus-per-task=32
#SBATCH --mem=128G
Please give this a try
#SBATCH --cpus-per-task=32 #SBATCH --mem=128G
This appears to have worked! Thank you!
Hello @szymanskishay,
yes, you sould use a different file for the --isolates
. Maybe you have some ITS sequences of the contaminanst present in the lab where the library was prepared? Or you can use culture isolates sequences (that is the original purpose of it).
I am working on a solution for your problem, will come up soon with a new fix of this https://github.com/Gian77/Cecilia soon.
Thanks, Gian
Thanks for your question @szymanskishay, and thanks for addressing it @Gian77. It seems like we've resolved the issue, so I'm closing.
I am attempting to assign taxonomy to Eukaryotes and repeatedly encounter an error with memory allotment, from what I can tell at least. I think it might be related to needing to use the duplicate taxon name fix but I am unsure. Slight sidebar, but I do really appreciate that this has built in a way to deal with the duplicate taxa problem. I have already tried going to my CTAX2 environment to change the default for -Xmx to allow more memory (from 1GB to 2GB), but a source I found online seemed to suggest that it couldnt go past 2GB very well, so I am hesitant to alter it further.
Relevant excerpt from the slurm below.
Training RDP Classifier java -Xmx -jar /mnt/research/rdp/public/RDPTools/classifier.jar train -o /mnt/home/shemans6/Cecilia/db///. -s /mnt/home/shemans6/Cecilia/db///sh_general_release_dynamic_all_29.11.2022_dev__RDP_trained.fasta -t /mnt/home/shemans6/Cecilia/db///sh_general_release_dynamic_all_29.11.2022_dev__RDP_taxonomy_trained.txt > rdp_train.out 2>&1 edu.msu.cme.rdp.classifier.train.NameRankDupException: Error: duplicate taxon name and rank in the taxonomy file. crambe genus 2 opercularia genus 2 phialina genus 2 gomphus genus 2 cynodon genus 2 mertensia genus 2 stokesia genus 2 globodera genus 2 alaria genus 2 chondrilla genus 2 rhytisma genus 2 siphonaria genus 2
RDP training error, redoing with duplicate taxa python /mnt/home/shemans6/anaconda3/envs/CTAX2/opt/constax-2.0.18-0/FormatRefDB.py -d /mnt/home/shemans6/Cecilia/db/sh_general_release_dynamic_all_29.11.2022_dev.fasta -t /mnt/home/shemans6/Cecilia/db// -f UNITE -p /mnt/home/shemans6/anaconda3/envs/CTAX2/opt/constax-2.0.18-0 --dup Importing subscripts from /mnt/home/shemans6/anaconda3/envs/CTAX2/opt/constax-2.0.18-0
Reformatting database
UNITE format detected
Reference database FASTAs formatted in 2.699999325 seconds...
Database formatting complete
java -Xmx -jar /mnt/research/rdp/public/RDPTools/classifier.jar train -o /mnt/home/shemans6/Cecilia/db///. -s /mnt/home/shemans6/Cecilia/db///sh_general_release_dynamic_all_29.11.2022_dev__RDP_trained.fasta -t /mnt/home/shemans6/Cecilia/db///sh_general_release_dynamic_all_29.11.2022_dev__RDP_taxonomy_trained.txt > rdp_train.out 2>&1 Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at edu.msu.cme.rdp.classifier.train.RawHierarchyTree.initWordOccurrence(RawHierarchyTree.java:124) at edu.msu.cme.rdp.classifier.train.TreeFactory.addSequencewithLineage(TreeFactory.java:273) at edu.msu.cme.rdp.classifier.train.TreeFactory.parseSequenceFile(TreeFactory.java:152) at edu.msu.cme.rdp.classifier.train.ClassifierTraineeMaker.(ClassifierTraineeMaker.java:65)
at edu.msu.cme.rdp.classifier.train.ClassifierTraineeMaker.main(ClassifierTraineeMaker.java:170)
at edu.msu.cme.rdp.classifier.cli.ClassifierMain.main(ClassifierMain.java:77)
Command 'bash -c '/mnt/home/shemans6/anaconda3/envs/CTAX2/opt/constax-2.0.18-0/constax_no_inputs.sh'' returned non-zero exit status 1.
When rdp_train.out gets written, it is just "Exception in thread "main"..." and the things indented under it. What is more curious is that it still makes RDP relevant files (ie: sh_general_release....RDP_taxonomy.txt, ..._RDP_taxonomy_headers.txt, ..._RDP_taxonomy_trained.txt, ..._RDP_trained.fasta) It seems to not make "training_check.txt", which causes a failure slightly further down (below)
CONSTAX2: Improved taxonomic classification of environmental DNA markers Julian Aaron Liber, Gregory Bonito, Gian Maria Niccolò Benucci Bioinformatics, Volume 37, Issue 21, 1 November 2021, Pages 3941–3943; doi: https://doi.org/10.1093/bioinformatics/btab347 Cannot classify without existing training files, please specify -t Command 'bash -c '/mnt/home/shemans6/anaconda3/envs/CTAX2/opt/constax-2.0.18-0/constax_no_inputs.sh'' returned non-zero exit status 1. grep: /mnt/home/shemans6/Cecilia/db///training_check.txt: No such file or directory
I am trying some fixes out currently but would love to hear anything that could help regardless.
Additional info: I am using the UNITE general eukaryotic data base release "sh_general_release_dynamic_all_29.11.2022_dev.fasta". I am using this on the MSU HPCC as well.