Input CSV structure - Githubissues

nf-core / taxprofiler

Highly parallelised multi-taxonomic profiling of shotgun short- and long-read metagenomic data

https://nf-co.re/taxprofiler

MIT License

128 stars 34 forks source link

Input CSV structure #15

Closed jfy133 closed 2 years ago

jfy133 commented 2 years ago

Description of feature

We will need to support both input (paired/single) FASTQ and also FASTA files, as the latter seems common.

I propose something like this:

sample_id,run_id,format,r1,r2
ERS0000001,ERR000008,fasta,/path/to/data1.fa,NA
ERS0000001,ERR000009,fastq,/path/to/data2_1.fq.gz,/path/to/data2_2.fq.gz
ERS0000001,ERR000009,fastq,/path/to/data3_1.fq.gz,NA

i.e.

UPDATE:	sample	run_accession	instrument_platform	fastq_1	fastq_2	fasta
ERS0000001	ERR000008	OXFORD_NANOPORE	NA	NA	/path/to/data1.fa	NA
ERS0000001	ERR000009	ILLUMINA	/path/to/data2_1.fq.gz	/path/to/data2_2.fq.gz
ERS0000001	ERR000009	ILLUMINA	/path/to/data3_1.fq.gz	NA

Check instrument_platform against: https://www.ebi.ac.uk/ena/portal/api/controlledVocab?field=instrument_platform

Midnighter commented 2 years ago

I would use as the input format what is spit out by https://github.com/nf-core/fetchngs so it generally has the same columns but different headers for it. I would drop the format column. If needed, that can be figured out from the filenames.

jfy133 commented 2 years ago

OK actually I agree, that's what I actually based this off of. Do you have an example of a fetchngs sheet?

jfy133 commented 2 years ago

I ran it once and the only samplesheet I got was filled with millions of columns which I didn't like

jfy133 commented 2 years ago

Nevermind, I saw this:

 --nf_core_pipeline           [string]  Name of supported nf-core pipeline e.g.  'rnaseq'. A samplesheet for direct use with the pipeline will be created with 
                                         the appropriate columns.

so we can customise it I guess

Midnighter commented 2 years ago

Yeah, it adds a lot of columns but we can pick the ones we need. I do think it's nice, though, if the pipeline keeps all input columns. This makes it easier for users to add any kind of meta information that they would like. The minimal information, in my opinion, is:

sample,fastq_1,fastq_2

maxulysse commented 2 years ago

@jfy133 I think this is csv and not tsv

Midnighter commented 2 years ago

CSV seems to be the standard in nf-core pipelines. In Python it's quite easy to allow both but that's harder in nextflow I think.

maxulysse commented 2 years ago

not at all, you have the splitCsv operator: https://www.nextflow.io/docs/latest/operator.html#splitcsv

Midnighter commented 2 years ago

Yes, but it cannot "sniff" if it's CSV or TSV by itself, so you either need to hard code it, look at the file extension, or let the user determine it.

maxulysse commented 2 years ago

Oh I see what you mean, then yes you're right. And as you said, csv is the standard in DSL2 nf-core pipelines.

jfy133 commented 2 years ago

Sorry yup - eager is TSV :sweat_smile:

maxulysse commented 2 years ago

Sarek was TSV too, we're now csv

jfy133 commented 2 years ago

Don't abanon me!

Back on topic:

accept: fastq, fq, fasta, fna. fa + all with .gz

jfy133 commented 2 years ago

@maxibor and I decided to go for an explicit .fasta column as this means fastq_1 and fastq_2 can be taken directly from fetchNGS

jfy133 commented 2 years ago

should change platform to specific machine, as we need 2/4 colour chemsity info

Midnighter commented 2 years ago

should change platform to specific machine, as we need 2/4 colour chemsity info

Can you provide some more context, please, why this is needed?

jfy133 commented 2 years ago

https://support.illumina.com/content/dam/illumina-support/help/Illumina_DRAGEN_Bio_IT_Platform_v3_7_1000000141465/Content/SW/Informatics/Dragen/PolyG_Trimming_fDG.htm

jfy133 commented 2 years ago

@maxibor did you add a check that if you can't supply FASTA and FASTQ in the same line?

jfy133 commented 2 years ago

I think this is set for now, can reopen if more issues crop up