nf-core / demultiplex

Demultiplexing pipeline for sequencing data
https://nf-co.re/demultiplex
MIT License
41 stars 36 forks source link

Add Illumina Samplesheet Validator #222

Closed grst closed 4 weeks ago

grst commented 1 month ago

Description of feature

I suggest to add a process that checks Illumina Samplesheets for consistency before starting bclconvert/bcl2fastq. This ensures that errors are caught early with useful error messages rather than demultiplexing failing at some later point.

I found samshee which is relativly new, but looks quite nice. It's available from PyPI, so creating a container via wave shouldn't be an issue.

Specifications:

CC @apeltzer

nschcolnicov commented 1 month ago

When testing the new module I saw that the validator would fail for some of the samplesheets we have in our test cases. For example we have the following samplesheet in the test profile: `[Header]

[Reads] 110

[Settings]

[Data] Sample_ID,Sample_Name,Description,Sample_Project Sample1,Sample1,, `

This raises an error in the tool because samshee a parse_settings function (from https://github.com/lit-regensburg/samshee/blob/main/src/samshee/sectionedsheet.py) which expects key-value pairs in the Settings section. By default, sections that are named "header" or "reads", or are suffixed "settings" are assumed to be settings sections. So, Reads section is being parsed as a Settings section, but it contains a single integer value rather than key-value pairs, leading to the validation error.

@grst Not sure if this is something we should address by raising an issue in the samshee tool or we could ignore this since it only happens for our mock samplesheets, while the test profiles that contain actual samplesheets like test_full, or test_two_lanes pass the validation without any issues.

grst commented 1 month ago

Some of our internal samplesheets look like this, they apparently follow an older schema: image

Does the error still occur when the mock samplesheet reads

[Reads],,,
110,,,

because technically it should be valid csv format with the appropriate number of commata.

If this doesn't help, it would be great to open a ticket in samshee an clarify if this is an issue with samshee or just an older samplesheet schema that's not supposed to be supported.

Since the validation is optional, I would anyway go ahead an implement it in the pipeline and adapt the mock samplesheets if necessary.

nschcolnicov commented 1 month ago

@grst I submitted the issue: https://github.com/lit-regensburg/samshee/issues/1 The validation is done by default in the implementation from my branch, so I had to add the parameter to skip it in all of our tests except the "test_two_lanes" one. I will wait to see what the developer of samshee says, and depending on that he says, I will update the samplesheets for the rest of the test profiles or leave them as they are. Alternatively I could modify the way the samshee tool is run, and have it be turned off by default. Regardless, I will address this on a separate PR, because the current one is already getting too messy. I also want to assess the demultiplexing test-datasets, to see if they are properly organized and if they are all in use.

apeltzer commented 4 weeks ago

Implemented in #234