nf-core / scrnaseq

A single-cell RNAseq pipeline for 10X genomics data
https://nf-co.re/scrnaseq
MIT License
178 stars 154 forks source link

Avoid error on unknown headers in input.csv #302

Closed zeehio closed 3 months ago

zeehio commented 4 months ago

Description of feature

The nf-co.re/rnaseq pipeline accepts and ignores any extra column in input.csv that is not required by the pipeline. This is useful because I can reuse the input.csv or include additional information I want to use in downstream analyses, without having to generate a specific input.csv just for running the pipeline.

This scrnaseq pipeline is much more strict, giving an error when any unknown column is found.

I would rather for the scrnaseq pipeline to follow the rnaseq behaviour, following the robustness principle that one should "be conservative in what you send, be liberal in what you accept".

Is there any specific reason why you are not as liberal accepting unknown columns in the input.csv file?

Thanks!

grst commented 3 months ago

Hi,

this issue should be fixed in the development version. You can give it a try with nextflow run ... -r dev. If it doesn't work, please let me know!

zeehio commented 2 months ago

Hi @grst I have been able now to test the dev pipeline. Thanks for the update. Unfortunately I am still facing validation issues:

I am using a single end dataset, where there is a fastq_1, but there is not a fastq_2.

The input.csv file is similar to:

"sample","fastq_1","fastq_2","strandedness",...
"id1","/path/to/fastq/sample1.fastq.gz","","auto",...

Please note how the fastq_2 column contains empty values.

I'm getting an error validating the 'input' again:

ERROR ~ ERROR: Validation of 'input' file failed!

 -- Check '.nextflow.log' file for details
The following errors have been detected:

* -- Entry 1: Missing required value: fastq_2
* -- Entry 2: Missing required value: fastq_2

Having an empty fastq_2 seems correct to me when I check the code at the master branch. There, if the fastq_2 is empty then the single_end variable is set to "1". You can see this below (specifically line 184, in the not fastq_2):

https://github.com/nf-core/scrnaseq/blob/90cb6a48155248286c85395c53a201c3a31b2258/bin/check_samplesheet.py#L181-L187

However on the dev branch, the input schema used for the fastq_2 validation must exist and can't be empty:

https://github.com/nf-core/scrnaseq/blob/10434418cfa345f07910ebf43d0a3db4b71f5be2/assets/schema_input.json#L16-L30

I'd like for the scrnaseq pipeline to accept an input file with a fastq_2 column filled with "" (empty strings), since that's what is generated by the nf-core/fetchngs pipeline when downloading datasets.

Thanks and sorry for the delay in the reply

zeehio commented 2 months ago

Just for further ideas, it may be good to checkout the rnaseq pipeline:

https://github.com/nf-core/rnaseq/blob/b89fac32650aacc86fcda9ee77e00612a1d77066/assets/schema_input.json#L16-L46

grst commented 2 months ago

The check is done on purpose. All protocols supported by this pipeline use paired end data, where R1 contains UMI/barcode and R2 the actual sequence.

What kind of single-cell data are you dealing with?