Switch to a CSV/TSV based input

nf-core / exoseq

Please consider using/contributing to https://github.com/nf-core/sarek

http://nf-co.re

MIT License

16 stars 28 forks source link

Switch to a CSV/TSV based input #18

Open marchoeppner opened 6 years ago

marchoeppner commented 6 years ago

For the sake of pulling in relevant meta data, I suggest to use CSV/TSV as default input format rather than a folder with a bunch of FastQ files.

Suggested format would be:

IndivID;SampleID;libraryID;rgID;rgPU;platform;platform_model;Center;Date;R1;R2

Peter;Germline;G00077-L2;HGJJMBBXX.3.G00077-L2;HGJJMBBXX.3.TCCTGAGC+ATAGAGAG;Illumina;NextSeq500;IKMB;2018-02-06;/ifs/data/nfs_share/sukmb352/projects/pipelines/exomes/trio/original_sequences/G00077-L2_S20_L003_R1_001.fastq.gz;/ifs/data/nfs_share/sukmb352/projects/pipelines/exomes/trio/original_sequences/G00077-L2_S20_L003_R2_001.fastq.gz

Peter;Tumor;G00078-L2;HGJJMBBXX.3.G00078-L2;HGJJMBBXX.3.GGACTCCT+ATAGAGAG;Illumina;NextSeq500;IKMB;2018-02-06;/ifs/data/nfs_share/sukmb352/projects/pipelines/exomes/trio/original_sequences/G00078-L2_S21_L003_R1_001.fastq.gz;/ifs/data/nfs_share/sukmb352/projects/pipelines/exomes/trio/original_sequences/G00078-L2_S21_L003_R2_001.fastq.gz

ewels commented 6 years ago

Or a nextflow params file? https://github.com/nextflow-io/nextflow/issues/208

CSV/TSV is nice and may be necessary here, but I'm also keen for nf-core pipelines to work with minimal input if possible. eg. Still working for someone who turns up with "I have a bunch of FastQ files and know nothing about them." If the pipeline fails because the user doesn't know the platform_model then that's not ideal.

Of course - that's not to say that it's not possible to have both, that would be ideal. Work with minimal requirements but also nice verbose well organised meta files.

marchoeppner commented 6 years ago

For these cases, we actually use this (pardon the crummy'ness of the code):

https://git.ikmb.uni-kiel.de/bfx-core/NF-diagnostics-exome/blob/master/bin/samplesheet_from_folder.rb

Builds a valid input CSV from a folder full of FastQs with actual values where extractable from the fastq files and place holders / best guesses for the other fields. This way you could at least nudge people towards better record keeping ;)

But two mutually exclusive input channels might also work.

maxulysse commented 6 years ago

We have a similar idea that we use for germline sample: https://github.com/SciLifeLab/Sarek/blob/master/main.nf#L738-L766

ewels commented 6 years ago

Nice! I guess we could embed such a script into the workflow so that it works with a glob of FastQs or a CSV file..? That would be ideal.

marchoeppner commented 6 years ago

My vote goes to the "Sarek" approach; should be fairly straight-forward to just steal the code ;)

apeltzer commented 6 years ago

Same here