Closed marymcelroy closed 1 year ago
Hi @marymcelroy
Your question depends on the available partitions/resources in your HPC. The more cores you assign, the faster your job will be completed.
For example, in the Zorbas HPC, I would normally use the batch partition (1 node and 20 cores). I would say that, if I were to run e.g. a job for 96 samples and 16S rRNA using the #SBATCH specifications and the parameters I normally use, it would take more or less a day. (You will need to run 3 separate jobs, since you have 3 genes and thus, you will have 3 different parameters files).
Μy shell script would be something like
module purge # unloads all previous loads
module load singularity/3.7.1 #loads singularity
singularity run -B /home1/christina/the_directory_where_the_mydata_folder_and_the_parameters_are/:/mnt/analysis /home1/christina/pema_v.2.1.4.sif
module unload singularity/3.7.1 #unloads singularity
Hi @marymcelroy.
@cpavloud is right. Just a few more comments from my side:
sandbox
, meaning once you build your sandbox
you can try to run the commands of step 2 in there and build your new ref db under the
/tools/RDPTools/TRAIN/
path on the image. If you can do this, then you won't have to train again and again the classifier every time you run your analysis, but you will be able to set the custom_ref_db
as No
and give the name of your db through the name_of_custom_db
parameter.
If this note here is making things more confusing, please ignore! In any case, if I had to guess, I would say that you wouldn't need more than 3 days time for any of your analysis no matter what parameters you ask for. A step that can really take some time if you have a great number of otus/asvs, is the getNCBITaxId
; I suggest to have this as No
at least until you have your first results.
Good luck! :100:
Thank you both very much for the suggestions!
@marymcelroy I now close this issue and in case you d like to give us any feedback, please feel free to open a "new discussion".
Thanks again for your interest on PEMA.
Hi Haris! I'd like to use PEMA on some metabarcoding data for my graduate work. I successfully installed the Singularity image on my university's HPC environment, but I was hoping for some advice about how to estimate the HPC resources I would need to run PEMA in my job script (we use Slurm). Specifically, do you have any guidance for the #SBATCH specifications and values I should use?
For context, I have 96 samples that were PE sequenced for COI, 18S, and 16S amplicons (euk eDNA metabarcoding) on an Illumina MiSeq, so I have 576 fastq files as my raw sequencing data. I would like to use a custom ref db for COI, so I will follow your instructions about training the RDP classifier (I know this will likely affect computational load). Thank you!