marymcelroy commented 1 year ago

Hi Haris! I'd like to use PEMA on some metabarcoding data for my graduate work. I successfully installed the Singularity image on my university's HPC environment, but I was hoping for some advice about how to estimate the HPC resources I would need to run PEMA in my job script (we use Slurm). Specifically, do you have any guidance for the #SBATCH specifications and values I should use?

For context, I have 96 samples that were PE sequenced for COI, 18S, and 16S amplicons (euk eDNA metabarcoding) on an Illumina MiSeq, so I have 576 fastq files as my raw sequencing data. I would like to use a custom ref db for COI, so I will follow your instructions about training the RDP classifier (I know this will likely affect computational load). Thank you!

cpavloud commented 1 year ago

Hi @marymcelroy

Your question depends on the available partitions/resources in your HPC. The more cores you assign, the faster your job will be completed.

For example, in the Zorbas HPC, I would normally use the batch partition (1 node and 20 cores). I would say that, if I were to run e.g. a job for 96 samples and 16S rRNA using the #SBATCH specifications and the parameters I normally use, it would take more or less a day. (You will need to run 3 separate jobs, since you have 3 genes and thus, you will have 3 different parameters files).

Μy shell script would be something like

!/bin/bash -l

SBATCH --partition=batch

SBATCH --nodes=1

SBATCH --ntasks-per-node=20

SBATCH --mem=40G

SBATCH --job-name="my_pema_job"

SBATCH --output=my_pema_job.output

SBATCH --requeue

module purge # unloads all previous loads

module load singularity/3.7.1 #loads singularity

singularity run -B /home1/christina/the_directory_where_the_mydata_folder_and_the_parameters_are/:/mnt/analysis /home1/christina/pema_v.2.1.4.sif

module unload singularity/3.7.1 #unloads singularity

hariszaf commented 1 year ago

Hi @marymcelroy.

@cpavloud is right. Just a few more comments from my side:

a good practice is to first run only a few samples of a marker to tune your parameters and estimate the required time for your whole dataset
for the case of COI and your ref db, you can first check if you can train the RDPClassifier in the sandbox, meaning once you build your sandbox you can try to run the commands of step 2 in there and build your new ref db under the /tools/RDPTools/TRAIN/ path on the image. If you can do this, then you won't have to train again and again the classifier every time you run your analysis, but you will be able to set the custom_ref_db as No and give the name of your db through the name_of_custom_db parameter. If this note here is making things more confusing, please ignore!

In any case, if I had to guess, I would say that you wouldn't need more than 3 days time for any of your analysis no matter what parameters you ask for. A step that can really take some time if you have a great number of otus/asvs, is the getNCBITaxId; I suggest to have this as No at least until you have your first results.

Good luck! :100:

marymcelroy commented 1 year ago

Thank you both very much for the suggestions!

hariszaf commented 1 year ago

@marymcelroy I now close this issue and in case you d like to give us any feedback, please feel free to open a "new discussion".

Thanks again for your interest on PEMA.

hariszaf / pema

HPC job specifications #49