nhoffman / dada2-nf

A Nextflow pipeline for processing 16S rRNA sequences using dada2
0 stars 2 forks source link

Standardize input file(s) #78

Closed crosenth closed 1 year ago

crosenth commented 1 year ago

Current process is a manifest with "sampleid,batch" columns with Malt creating an additional fastq-file.txt file with fastq paths. The pipeline has a starting channel shape of "sampleid,batch,direction,fastq,index1,index2". Still need to check in with Sam on what is best for Cirro. If Malt and Cirro can generate the required channel shape "sampleid,batch,direction,fastq,index1,index2" that would simplify the pipeline.

nhoffman commented 1 year ago

Chris, I think the design that Sam described placed each sample on a single row like this:

"sampleid,...,batch,R1,R2,I1,I2"

where we'd include columns for additional labels and metadata (need to look at what exactly the lab is providing). So no need to specify direction, and we'd include a path for each of R1 and R2.

We should also think about how we are specifying primer and adapter sequences - should it be here or in the config?

crosenth commented 1 year ago

Great that works and any additional columns can be passed thru

sminot commented 1 year ago

I don't think I've heard the name "Malt" before, but I'll assume that it's the LIMS-associated software which is preparing the data generated by the sequencer to be run through dada2-nf.

One approach which might be the least disruptive would be to give the user the option to provide the full samplesheet using the format which is being used currently (sampleid,batch,direction,fastq,index1,index2). If that samplesheet is provided, then the manifest reconstruction step is skipped. However, all of your current automation could still stay in place as-is.

While I did mention sampleid,...,batch,R1,R2,I1,I2 as being my default samplesheet format, it's trivial to produce sampleid,batch,direction,fastq,index1,index2 as well. The only requirement for Cirro is to remove the behavior using fastq-file.txt for inputs, which doesn't play well with object storage.

Re: primer and adapter sequences – my inclination would be to specify them in the global scope using params. If alternatively you see a need for using different primers for different samples, then it would make sense to specify them in the samplesheet.

crosenth commented 1 year ago

Let's go with your default samplesheet format sampleid,...,batch,R1,R2,I1,I2 with no fastq-file.txt as Noah stated.

Yes Malt is our in-house LIMS application and we will program that from our end

sminot commented 1 year ago

Let's go with your default samplesheet format sampleid,...,batch,R1,R2,I1,I2 with no fastq-file.txt as Noah stated.

Yes Malt is our in-house LIMS application and we will program that from our end

Just wanted to make sure I'm clear -- do you like the "make no changes to the existing inputs" approach? Or do you want to go ahead and make breaking changes that will then require changes upstream in Malt?

crosenth commented 1 year ago

Sam, what does the ,..., mean for the samplesheet format?

sminot commented 1 year ago

I copied that from Noah's comment -- I assumed he was referring to additional columns with unspecified content

nhoffman commented 1 year ago

That's right ... just indicates additional columns that will be ignored but included in the sample sheet copied to the output directory.

Since most of the time the fastqs are all in the same directory, I'm considering a convention of

sampleid,...,batch,datadir,R1,R2,I1,I2

If datadir has a value, it will be used as the parent directory for whatever path (typically just a filename) is provided for each of the files. Any thougts on this? Cirro can always modify is necessary.

sminot commented 1 year ago

That sounds fine to me, @nhoffman

Before I dive in, I'd like to make sure that we're all on the same page for the other points which have come up:

nhoffman commented 1 year ago

@sminot -

sminot commented 1 year ago

That sounds great. I'm making my changes on https://github.com/nhoffman/dada2-nf/tree/cloud_compatibility. I'll make a PR once the test run to completion in Cirro