nf-core / demultiplex

Demultiplexing pipeline for sequencing data
https://nf-co.re/demultiplex
MIT License
41 stars 36 forks source link

For sample demultiplexing from FASTQs, add fqtk. #87

Closed edmundmiller closed 3 weeks ago

edmundmiller commented 1 year ago

Description of feature

fqtk is a rust re-write of fgbio DemuxFastqs tool.

nh13 commented 1 year ago

CC: @tfenne @natproach

edmundmiller commented 1 year ago

@nh13 @sam-white04 Some thoughts on the samplesheet. I don't think using params will be very "nf-core" and causes fqtk to have it's own special path that isn't like the others. nf-core has moved away from that with the dsl2 pipelines. I'm just the messanger.

So to clarify, I'm against:

nextflow run nf-core/demultiplex --fastq_files out_L001_I1_001.fastq.gz out_L001_I2_001.fastq.gz out_L001_R1_001.fastq.gz out_L001_R2_001.fastq.gz --read_structure 8B 8B 150T 150T --demultiplexer fqtk

And instead the usually --input samplesheet.csv. This allows users to run multiple samples and has less footguns.

I see two different options for a samplesheet and a third if you just want to use the native fqtk samplesheets.

  1. All in one line
    sample,fastq,read_structure
    test,out_L001_I1_001.fastq.gz out_L001_I2_001.fastq.gz out_L001_R1_001.fastq.gz out_L001_R2_001.fastq.gz,8B 8B 150T 150T
    somesample,somesample/out_L001_I1_001.fastq.gz somesample/out_L001_I2_001.fastq.gz somesample/out_L001_R1_001.fastq.gz somesample/out_L001_R2_001.fastq.gz,8B 8B 150T 150T
  2. A line per fastq (I think this will have the least user errors)
sample,fastq,read_structure
test_1,out_l001_i1_001.fastq.gz,8b
test_2,out_l001_i2_001.fastq.gz,8b
test_3,out_l001_r1_001.fastq.gz,150t
test_4,out_l001_r2_001.fastq.gz,150t
somesample_1,somesample/out_l001_i1_001.fastq.gz,8b
somesample_2,somesample/out_l001_i2_001.fastq.gz,8b
somesample_3,somesample/out_l001_r1_001.fastq.gz,150t
somesample_4,somesample/out_l001_r2_001.fastq.gz,150t
  1. Use the native fqtk samplesheet handling (I think I saw it has that functionality?) But I think some of the issues will be around getting the relative files paths right and what not. It might be painful. This could as pass the fastq directory or something, just spitballing.
    sample,fqtk_samplesheet
    test,test_samplesheet.csv
    somesample,somesample/samplesheet.csv
sam-white04 commented 1 year ago

@nh13 @sam-white04 Some thoughts on the samplesheet. I don't think using params will be very "nf-core" and causes fqtk to have it's own special path that isn't like the others. nf-core has moved away from that with the dsl2 pipelines. I'm just the messanger.

So to clarify, I'm against:

nextflow run nf-core/demultiplex --fastq_files out_L001_I1_001.fastq.gz out_L001_I2_001.fastq.gz out_L001_R1_001.fastq.gz out_L001_R2_001.fastq.gz --read_structure 8B 8B 150T 150T --demultiplexer fqtk

And instead the usually --input samplesheet.csv. This allows users to run multiple samples and has less footguns.

I see two different options for a samplesheet and a third if you just want to use the native fqtk samplesheets.

  1. All in one line
sample,fastq,read_structure
test,out_L001_I1_001.fastq.gz out_L001_I2_001.fastq.gz out_L001_R1_001.fastq.gz out_L001_R2_001.fastq.gz,8B 8B 150T 150T
somesample,somesample/out_L001_I1_001.fastq.gz somesample/out_L001_I2_001.fastq.gz somesample/out_L001_R1_001.fastq.gz somesample/out_L001_R2_001.fastq.gz,8B 8B 150T 150T
  1. A line per fastq (I think this will have the least user errors)
sample,fastq,read_structure
test_1,out_l001_i1_001.fastq.gz,8b
test_2,out_l001_i2_001.fastq.gz,8b
test_3,out_l001_r1_001.fastq.gz,150t
test_4,out_l001_r2_001.fastq.gz,150t
somesample_1,somesample/out_l001_i1_001.fastq.gz,8b
somesample_2,somesample/out_l001_i2_001.fastq.gz,8b
somesample_3,somesample/out_l001_r1_001.fastq.gz,150t
somesample_4,somesample/out_l001_r2_001.fastq.gz,150t
  1. Use the native fqtk samplesheet handling (I think I saw it has that functionality?) But I think some of the issues will be around getting the relative files paths right and what not. It might be painful. This could as pass the fastq directory or something, just spitballing.
sample,fqtk_samplesheet
test,test_samplesheet.csv
somesample,somesample/samplesheet.csv

@Emiller88 I am happy to re-work the current workflow for fqtk/main.nf so that read structures and file names are provided via a sample sheet. However, I am a bit confused what you mean by the native fqtk sample sheet. The tool fqtk does take a sample sheet but this sample sheet has two columns 'sample-id' and 'barcode'. The demultiplex pipeline also takes a sample-sheet input. So I see two options: 1) Add a column to the sample-sheet already input to demultiplex so it may look like this for fqtk:

flowcell,samplesheet,read_structure_manifest,lane,run_dir
DDMMYY_SERIAL_NUMBER_FC,/path/to/samplesheet.tsv,/path/to/read_structure_manifest.csv,,path/to/fastq.tar.gz

2) Adapt the existing samplesheet from what it is currently (sample id, barcode) to the example shown below. Columns 3+4 could be parsed out of this file, and columns 1+2 could remain untouched:

sample_id   barcode        fastq_file_name    read_structure
s1  AAGCCCAATAAACCAC        out_L001_I1_001.fastq.gz        8B
s2  TCTGACTGGCCGAATA        out_L001_I2_001.fastq.gz        8B
s3  GGGATATAGGCAACGA        out_L001_R1_001.fastq.gz        150T
s4  CATGTGCGGCGACCCT        out_L001_R2_001.fastq.gz        150T
s5  TGCGACAGTGACGCTT
s6  TCGCCGTTGCCTAAAC
s7  CTATTTGAAGGAGTCT
s8  AGCAGCCGCAGTAAGG
s9  CACAATACCTCGTCCG
s10 TGTTACCAGACCAAAC
s11 AAGACGTCCTCTTCAA

@Emiller88 & @nh13 please let me know what you think (I hope my examples above are clear)

-Sam

edmundmiller commented 1 year ago

I kinda like the samplesheet serving dual purpose. Would it be possible for it to be a csv for consistancy? I guess we can parse into the TSV in the pipeline and wrangle the data in whatever way.

sam-white04 commented 1 year ago

@Emiller88 just to confirm, you like option 2, correct? And I am not sure if it can be a csv. Let me get back to you on that.

nh13 commented 1 year ago

I think we're conflating a few things.

  1. A sample sheet should list the meta data per-sample. For example, the sample name, identifier, sample barcode and so on. For demux tools, it specifies the minimal information needed to identify samples (with a barcode) and name the outputs (sample id)
  2. The read structure and set of FASTQs is per-sequencing run (or per lane for those that have lanes). I suppose one could prepare samples with different read structures and pool them for sequencing, but that's not supported by most demux'ing software, so let's ignore it. I think we do not want to call this a "sample sheet", since it's not about samples, it's about the demultiplexing setup.

I think the nf-core/demultiples documentation is extremely confusing, since this pipeline uses "sample sheet" twice, once for a file to specify each flowcell/experiment/lane, and then again for the actual sample sheet that is in the first CSV/TSV file.

@sam-white04 your option (2) above tries to have both sample information and run information one CSV/TSV. Trying to use (1) to store information about (2) is in my opinion confusing. (1) has N rows (one-per-sample), where as (2) could be a TSV/CSV with one row per sequencing-run (or lane, or set of FASTQs to demux). That's super confusing and error prone.

I think the top-level "sample sheet" (I'd rename "demux meta" or "demux setup" or something like that) that contains the per-run/flowcell can have two columns: a. column with space separated FASTQ paths b. column with spaces seperated read structures

I think therefore I am agreeing with @Emiller88's all-one-line option.

I'd be OK with a line per FASTQ, but the first column should not be sample but instead flowcell, and have the same value for each related FASTQ.

nh13 commented 1 year ago

Here's what I am thinking.

sample_sheet.csv (aka "the per-flowcell actual samplesheet", fqtk needs TSV, but that's easily converted from the CSV)

sample_id,barcode
s1,AAGCCCAATAAACCAC
s2,TCTGACTGGCCGAATA
s3,GGGATATAGGCAACGA
s4,CATGTGCGGCGACCCT
s5,TGCGACAGTGACGCTT
s6,TCGCCGTTGCCTAAAC
s7,CTATTTGAAGGAGTCT
s8,AGCAGCCGCAGTAAGG
s9,CACAATACCTCGTCCG
s10,TGTTACCAGACCAAAC
s11,AAGACGTCCTCTTCAA

per_flowcell_manifest.csv (new metadata file)

fastq,read_structure
out_L001_I1_001.fastq.gz,8B
out_L001_I2_001.fastq.gz,8B
out_L001_R1_001.fastq.gz,150T
out_L001_R2_001.fastq.gz,150T

demultiplex_manifest.csv (aka "the full samplesheet", passed into the pipeline)

flowcell,samplesheet,manifest,run_dir,lane
fc1,/path/to/fc1.sample_sample_sheet.csv,/path/to/fc1.per_flowcell_manifest.csv,/path/to/fc1_dir.tar.gz,
fc2,/path/to/fc1.sample_sample_sheet.csv,/path/to/fc2.per_flowcell_manifest.csv,/path/to/fc2_dir.tar.gz,
fc3,/path/to/fc1.sample_sample_sheet.csv,/path/to/fc3.per_flowcell_manifest.csv,/path/to/fc3_dir.tar.gz,
fc4,/path/to/fc1.sample_sample_sheet.csv,/path/to/fc4.per_flowcell_manifest.csv,/path/to/fc4_dir.tar.gz,

The per_flowcell_manifest.csv would specify the name of each FASTQ that is untared, per-flowcell. I suppose we could also provide a regex there too (e.g. .*_L001_I1.*.fastq.gz instead of out_L001_I1_001.fastq.gz).

edmundmiller commented 1 year ago

Made #98 as a follow up

apeltzer commented 3 weeks ago

I think this was addressed quite a while ago and thus could be closed :)