hariszaf / pema

PEMA: a flexible Pipeline for Environmental DNA Metabarcoding Analysis of the 16S/18S rRNA, ITS and COI marker genes
27 stars 12 forks source link

Storage usage of PEMA #65

Open savvas-paragkamian opened 11 months ago

savvas-paragkamian commented 11 months ago

This is more of a question of how PEMA uses storage for each run. For my project I have 140 samples with PE sequences resulting to 14 gb of data.

14G ./my data
196G /pema215_otu

Is possible to reduce the storage needed for a run of PEMA or all output is required?

For example I have 2 all_samples.fasta (one in mainOutput and one in PEMA folder) files and 1 final_all_samples.fasta, are all necessary?

Also some intermediate folders like linearizedSequences, mergedSequences take up similar space as the mydata folder.

The reason for this issue is that in large scale projects this can lead to exceeding disk quota.

hariszaf commented 11 months ago

Hi @savvas-paragkamian. Thanks for the points.

The all_samples.fasta should be removed from the top output folder.

In general, a feature could be added so files that are not being used from a step and afterwards could be removed on the fly.

At the moment pema returns everything so the user can validate the filtering parameters and their affect.

However, it might be a good option to remove intermediate files optionally for such cases.