iobis / Project-team-Genetic-Data

Developing guidelines for adding sequence data to OBIS
10 stars 1 forks source link

Standardized bioinformatic pipeline #6

Open SSuominen1 opened 3 years ago

SSuominen1 commented 3 years ago

How is it best to register used bioinformatic tool/pipelines?

I understood there are some developments for this in ocean best practices, we should look into that.

Through the PacMAN project, OBIS will also be developing a pipeline, or researching how output from existing pipelines will be formatted for Dwc-A. Is there need for this from other users?

cpavloud commented 2 years ago

Could we use the term "identificationRemarks" to specify the pipeline used (along with all its relevant - user selected - parameters, separated by vertical bar space ( | )) and the "identificationReferences" term for the reference/citation/url of the pipeline?

dschigel commented 2 years ago

Looks like our DNA guide recommends identifictionRererences, see https://docs.gbif.org/publishing-dna-derived-data/1.0/en/#mapping-metabarcoding-edna-and-barcoding-data @thomasstjerne please take a look: I think the issue that we have remarks and reference, but no clear place to paste the pipeline name. One may claim that reference includes the name and number, but perhaps this is not good enough for @cpavloud?

pieterprovoost commented 2 years ago

Just thinking out loud here, but for many pipelines a run with a specific set of parameters will be defined by a custom configuration file or makefile. Perhaps the recommendation should be that this file is committed to source control (GitHub or other) and included as one of the identificationReferences. I think that would benefit reproducibility.

cpavloud commented 2 years ago

@dschigel My issue is that a) in the case that a pipeline is used (e.g. QIIME2), providing just the name is not enough. The parameters that were selected by the user for each step of the bioinformatic analysis should be documented, so that the analysis is replicable. b) in the case that different individuals tools are used (one for each step of the analysis, e.g. sickle for the quality filtering, pandaseq for the merging, UCHIME for the chimera removal etc.) then the identificationReferences should contain more than name and also (again) the parameters that were selected by the user for each tool should be documented.

@pieterprovoost yes, this is a good idea and it can be used for certain pipelines. Also, maybe the sop term can be used for a full documentation of the analysis instead of the identificationReferences? In this case (again), the user/data provider should have deposited the sop in a (GitHub or other) repository.

thomasstjerne commented 2 years ago

@cpavloud in the DNA derived data extension there are dedicated fields for (at least some) individual pipeline steps. For example the field chimera_check is supposed to have a value like uchime;v4.1;default parameters.

These fields origins from the MIxS standard and I think it would be fair to ask if e.g. the seq_quality_check field is appropriate for information about quality filtering. And also if there is a field intended for the merging.

But I think that it would always be desirable to have a link in the sop field to a structured pipeline description.