Data provenance considerations

hariszaf / pema

PEMA: a flexible Pipeline for Environmental DNA Metabarcoding Analysis of the 16S/18S rRNA, ITS and COI marker genes

27 stars 12 forks source link

Data provenance considerations #28

Closed cpavloud closed 2 years ago

cpavloud commented 2 years ago

We should add in the parameters file the version of SWARM algorithm that is implemented in PEMA. Also, the version of CROP and of the RAxML-ng (and PaPaRa and EPA-ng). And the version of cutadapt that is being used for the primer removal in the case of ITS. And for the MIDORI database, we need to specify the GenBank release that it was based on. I think that for all the other tools, the versioning information is already there.

Also, we should mention somewhere in the parameters file that RPDClassifier is being used for the COI gene and we should also mention the version of the RPDClassifier. Similarly, we should also mention the CREST is being used for the 16S, 18S and ITS markers.

Also, we should add the thresholds/default values used by the classifiers for the taxonomic identification of the sequences. Then, we could add this information in the otu_seq_comp_appr term when submitting data to GBIF/OBIS using the DwC-A format.

Then, after every analysis, the user will have full provenance (regarding tools and parameters implemented) stored in the copy of the parameters file inside the output folder.

hariszaf commented 2 years ago

We could build a .sh script that will be in the end of the Dockerfile that will run commands such as swarm --version and keep that in a file that would be part of the image. This way when we have an updated version, this will be automatically be updated.

hariszaf commented 2 years ago

Since pema:v.2.1.4 the user may find 2 files when running a pema container, pema_environment.tsv and pema_R_packages.tsv describing the issues mentioned. In addition, there is also the encapsulated_software.md file under the help_files directory with a snapshot of the software at pema:v.2.1.4.

cpavloud commented 2 years ago

Opening again the issue...!

Could we have something like a "run id"? So that the user can separate different runs using the same data but with different parameters?

Or something like a date, time and information on where PEMA is running, i.e. on which cluster?

I am thinking that it cannot be done automatically. But could it be added by the user perhaps? As an extra, non-meaningful, non-software related, parameter?

hariszaf commented 2 years ago

The easiest thing to do would be to add a uuid. This way you could have on the copy of the parameters.tsv file that is produced by pema and is saved on the output directory, something like that:

run_id c87d8885-95e8-4cc2-8f33-48dda0cc4467

To do so, you just use the following bash command: uuid=$(uuidgen)

In addition, you could also have something based on the date and time, for example:

10:22--01-12-2022

Again, to do this would be a single bash command: date +'%I:%M--%m-%d-%Y'

Tell me if you like something like that and I ll try to build an image as soon as possible.

cpavloud commented 2 years ago

That's great!

hariszaf commented 2 years ago

This has been moved to issue #35.