cirro integration - Githubissues

nhoffman commented 1 year ago

@sminot - as @crosenth and I discussed this, it seems that all of the pieces are in place to support the following workflow without the need for additional code.

Consider this workflow:

sequencing core uploads fastqs to s3
lab user identifies the sequencing run, and uploads sample information in the form of an excel file containing (at minimum) columns "sampleid" (filename prefix for fastq files) and "batch" (other fields are included in a pipeline output that is used in downstream analyses).
cirro generates a file containing the full s3 path of all uploaded fastqs (in fastq_list.txt)
cirro prompts the user for pipeline params (or substitutes defaults)
cirro generates a json-format params file specifying the s3 path for the uploaded excel sheet, the fastq_list, and pipeline params
cirro launches the pipeline with the params file as an input
pipeline associates fastqs with metadata (https://github.com/nhoffman/dada2-nf/blob/master/bin/manifest.py)

There is also an option to provide the pipeline with a unified manifest + fastq list in this fomat: https://github.com/nhoffman/dada2-nf/blob/master/test/manifest.csv - but this begs the question of where (and when) this file would be generated... I think that we've come full circle to our initial conversations about Cirro wanting to combine the fastq files with associated metadata independently from the pipeline. I think that I'm understanding now that this assumes that the fastq files and manifest will be uploaded at the same time, which doesn't align with the expected lab workflow. So perhaps the original model for providing inputs (fastq_list.txt + manifest.xlsx) is the most convenient given the expected workflow. If so, I think we're ready to go.

If you would like to test the above, see fastqs and sample sheet below (in bvdiversity):

data/miseq-plate-90/run-files/*/Data/Intensities/BaseCalls/m90*.fastq.gz
data/miseq-plate-90/sample-information/sample-information-m90.xlsx

sminot commented 1 year ago

Oh this is fantastic, and it totally gives me enough to go on.

Addressing your question about the timing of upload for the FASTQs vs. metadata, I think that we should probably be assuming by default that the FASTQs are being uploaded from a sequencing run with no metadata associated whatsoever. We can then have the user upload a manifest (with columns for sampleid and batch required, all other columns being applied as metadata). The user can then select the batch of FASTQs, select the uploaded manifest, and click "Run".

Thinking ahead to the metadata which is managed inside Cirro itself (via the GUI, or by uploading a samplesheet), my inclination would be to have any additional metadata columns in the user-uploaded manifest file take precedence (in the case of any overlapping keys). That's probably not a circumstance that we will encounter, but it's good to think ahead. Let me know if you have any objection to that behavior.

Assuming you don't have any objection, I'll move ahead with the example files you provided and get this wrapped together!

nhoffman commented 1 year ago

No objection - let's give it a shot!

sminot commented 1 year ago

Here's a point of clarification that I forgot about -- what name gets assigned based on the FASTQ file name.

The default behavior in Cirro is to just use the name provided with the library. So sample_name_S23_L001_R1_001.fastq.gz would be sample_name.

However, I noticed in our earlier conversations that you had recommended using everything before the first underscore, so sample_name_S23_L001_R1_001.fastq.gz -> sample.

I could try to make it support either one of those, but it would probably be easiest to just pick one. Do you have a strong preference as to which behavior would be easiest for our intended users, @nhoffman?

nhoffman commented 1 year ago

Either method would work as long as the convention to include only alphanumeric characters or hyphens in the sample name is followed (which has been the case for all samples provided so far). But I'm not sure how you would infer the name without defining a delimiter in the absence of a manifest? Does the remainder of the name (in your example, _S23_L001_R1_001.fastq.gz) always match a pattern that can be used to remove the non-sample-name part?

sminot commented 1 year ago

Ah, this is very useful. So including the underscore in the sampleid field would cause downstream issues? In that case, the only issue I can think of would be if there are any other non-alphanumeric characters in the file names.

Do you think we should raise an error if there are non-alphanumeric characters before the first underscore? Or just keep it as-is and trust the user not to submit samples to Genomics for sequencing which are named with hyphens or periods?

nhoffman commented 1 year ago

I shouldn't have specified alphanumeric. Pretty sure we're just assuming

sample_name = file_name.split('_')[0]

... and then using sample_name as a key into the sample info.

sminot commented 1 year ago

Everything is running nicely in Cirro using the approach we outlined. I just noticed that not all of the outputs were being populated.

I made a PR which I think should fix it: https://github.com/nhoffman/dada2-nf/pull/83

sminot commented 1 year ago

It appears to have completed successfully!

nhoffman / dada2-nf

cirro integration #82