Closed edmundmiller closed 3 weeks ago
CC: @tfenne @natproach
@nh13 @sam-white04
Some thoughts on the samplesheet. I don't think using params will be very "nf-core" and causes fqtk
to have it's own special path that isn't like the others. nf-core has moved away from that with the dsl2 pipelines. I'm just the messanger.
So to clarify, I'm against:
nextflow run nf-core/demultiplex --fastq_files out_L001_I1_001.fastq.gz out_L001_I2_001.fastq.gz out_L001_R1_001.fastq.gz out_L001_R2_001.fastq.gz --read_structure 8B 8B 150T 150T --demultiplexer fqtk
And instead the usually --input samplesheet.csv
. This allows users to run multiple samples and has less footguns.
I see two different options for a samplesheet and a third if you just want to use the native fqtk
samplesheets.
sample,fastq,read_structure
test,out_L001_I1_001.fastq.gz out_L001_I2_001.fastq.gz out_L001_R1_001.fastq.gz out_L001_R2_001.fastq.gz,8B 8B 150T 150T
somesample,somesample/out_L001_I1_001.fastq.gz somesample/out_L001_I2_001.fastq.gz somesample/out_L001_R1_001.fastq.gz somesample/out_L001_R2_001.fastq.gz,8B 8B 150T 150T
sample,fastq,read_structure
test_1,out_l001_i1_001.fastq.gz,8b
test_2,out_l001_i2_001.fastq.gz,8b
test_3,out_l001_r1_001.fastq.gz,150t
test_4,out_l001_r2_001.fastq.gz,150t
somesample_1,somesample/out_l001_i1_001.fastq.gz,8b
somesample_2,somesample/out_l001_i2_001.fastq.gz,8b
somesample_3,somesample/out_l001_r1_001.fastq.gz,150t
somesample_4,somesample/out_l001_r2_001.fastq.gz,150t
fqtk
samplesheet handling (I think I saw it has that functionality?)
But I think some of the issues will be around getting the relative files paths right and what not. It might be painful. This could as pass the fastq directory or something, just spitballing.
sample,fqtk_samplesheet
test,test_samplesheet.csv
somesample,somesample/samplesheet.csv
@nh13 @sam-white04 Some thoughts on the samplesheet. I don't think using params will be very "nf-core" and causes
fqtk
to have it's own special path that isn't like the others. nf-core has moved away from that with the dsl2 pipelines. I'm just the messanger.So to clarify, I'm against:
nextflow run nf-core/demultiplex --fastq_files out_L001_I1_001.fastq.gz out_L001_I2_001.fastq.gz out_L001_R1_001.fastq.gz out_L001_R2_001.fastq.gz --read_structure 8B 8B 150T 150T --demultiplexer fqtk
And instead the usually
--input samplesheet.csv
. This allows users to run multiple samples and has less footguns.I see two different options for a samplesheet and a third if you just want to use the native
fqtk
samplesheets.
- All in one line
sample,fastq,read_structure test,out_L001_I1_001.fastq.gz out_L001_I2_001.fastq.gz out_L001_R1_001.fastq.gz out_L001_R2_001.fastq.gz,8B 8B 150T 150T somesample,somesample/out_L001_I1_001.fastq.gz somesample/out_L001_I2_001.fastq.gz somesample/out_L001_R1_001.fastq.gz somesample/out_L001_R2_001.fastq.gz,8B 8B 150T 150T
- A line per fastq (I think this will have the least user errors)
sample,fastq,read_structure test_1,out_l001_i1_001.fastq.gz,8b test_2,out_l001_i2_001.fastq.gz,8b test_3,out_l001_r1_001.fastq.gz,150t test_4,out_l001_r2_001.fastq.gz,150t somesample_1,somesample/out_l001_i1_001.fastq.gz,8b somesample_2,somesample/out_l001_i2_001.fastq.gz,8b somesample_3,somesample/out_l001_r1_001.fastq.gz,150t somesample_4,somesample/out_l001_r2_001.fastq.gz,150t
- Use the native
fqtk
samplesheet handling (I think I saw it has that functionality?) But I think some of the issues will be around getting the relative files paths right and what not. It might be painful. This could as pass the fastq directory or something, just spitballing.sample,fqtk_samplesheet test,test_samplesheet.csv somesample,somesample/samplesheet.csv
@Emiller88 I am happy to re-work the current workflow for fqtk/main.nf so that read structures and file names are provided via a sample sheet. However, I am a bit confused what you mean by the native fqtk sample sheet. The tool fqtk does take a sample sheet but this sample sheet has two columns 'sample-id' and 'barcode'. The demultiplex pipeline also takes a sample-sheet input. So I see two options: 1) Add a column to the sample-sheet already input to demultiplex so it may look like this for fqtk:
flowcell,samplesheet,read_structure_manifest,lane,run_dir
DDMMYY_SERIAL_NUMBER_FC,/path/to/samplesheet.tsv,/path/to/read_structure_manifest.csv,,path/to/fastq.tar.gz
2) Adapt the existing samplesheet from what it is currently (sample id, barcode) to the example shown below. Columns 3+4 could be parsed out of this file, and columns 1+2 could remain untouched:
sample_id barcode fastq_file_name read_structure
s1 AAGCCCAATAAACCAC out_L001_I1_001.fastq.gz 8B
s2 TCTGACTGGCCGAATA out_L001_I2_001.fastq.gz 8B
s3 GGGATATAGGCAACGA out_L001_R1_001.fastq.gz 150T
s4 CATGTGCGGCGACCCT out_L001_R2_001.fastq.gz 150T
s5 TGCGACAGTGACGCTT
s6 TCGCCGTTGCCTAAAC
s7 CTATTTGAAGGAGTCT
s8 AGCAGCCGCAGTAAGG
s9 CACAATACCTCGTCCG
s10 TGTTACCAGACCAAAC
s11 AAGACGTCCTCTTCAA
@Emiller88 & @nh13 please let me know what you think (I hope my examples above are clear)
-Sam
I kinda like the samplesheet serving dual purpose. Would it be possible for it to be a csv for consistancy? I guess we can parse into the TSV in the pipeline and wrangle the data in whatever way.
@Emiller88 just to confirm, you like option 2, correct? And I am not sure if it can be a csv. Let me get back to you on that.
I think we're conflating a few things.
I think the nf-core/demultiples documentation is extremely confusing, since this pipeline uses "sample sheet" twice, once for a file to specify each flowcell/experiment/lane, and then again for the actual sample sheet that is in the first CSV/TSV file.
@sam-white04 your option (2) above tries to have both sample information and run information one CSV/TSV. Trying to use (1) to store information about (2) is in my opinion confusing. (1) has N rows (one-per-sample), where as (2) could be a TSV/CSV with one row per sequencing-run (or lane, or set of FASTQs to demux). That's super confusing and error prone.
I think the top-level "sample sheet" (I'd rename "demux meta" or "demux setup" or something like that) that contains the per-run/flowcell can have two columns: a. column with space separated FASTQ paths b. column with spaces seperated read structures
I think therefore I am agreeing with @Emiller88's all-one-line option.
I'd be OK with a line per FASTQ, but the first column should not be sample
but instead flowcell
, and have the same value for each related FASTQ.
Here's what I am thinking.
sample_sheet.csv (aka "the per-flowcell actual samplesheet", fqtk
needs TSV, but that's easily converted from the CSV)
sample_id,barcode
s1,AAGCCCAATAAACCAC
s2,TCTGACTGGCCGAATA
s3,GGGATATAGGCAACGA
s4,CATGTGCGGCGACCCT
s5,TGCGACAGTGACGCTT
s6,TCGCCGTTGCCTAAAC
s7,CTATTTGAAGGAGTCT
s8,AGCAGCCGCAGTAAGG
s9,CACAATACCTCGTCCG
s10,TGTTACCAGACCAAAC
s11,AAGACGTCCTCTTCAA
per_flowcell_manifest.csv (new metadata file)
fastq,read_structure
out_L001_I1_001.fastq.gz,8B
out_L001_I2_001.fastq.gz,8B
out_L001_R1_001.fastq.gz,150T
out_L001_R2_001.fastq.gz,150T
demultiplex_manifest.csv (aka "the full samplesheet", passed into the pipeline)
flowcell,samplesheet,manifest,run_dir,lane
fc1,/path/to/fc1.sample_sample_sheet.csv,/path/to/fc1.per_flowcell_manifest.csv,/path/to/fc1_dir.tar.gz,
fc2,/path/to/fc1.sample_sample_sheet.csv,/path/to/fc2.per_flowcell_manifest.csv,/path/to/fc2_dir.tar.gz,
fc3,/path/to/fc1.sample_sample_sheet.csv,/path/to/fc3.per_flowcell_manifest.csv,/path/to/fc3_dir.tar.gz,
fc4,/path/to/fc1.sample_sample_sheet.csv,/path/to/fc4.per_flowcell_manifest.csv,/path/to/fc4_dir.tar.gz,
The per_flowcell_manifest.csv
would specify the name of each FASTQ that is untared, per-flowcell. I suppose we could also provide a regex there too (e.g. .*_L001_I1.*.fastq.gz
instead of out_L001_I1_001.fastq.gz
).
Made #98 as a follow up
I think this was addressed quite a while ago and thus could be closed :)
Description of feature
fqtk is a rust re-write of fgbio DemuxFastqs tool.