jhuapl-bio / taxtriage

TaxTriage is a Nextflow workflow designed to agnostically identify and classify microbial organisms within short- or long-read metagenomic NGS data. This flexible tool was developed with various use-cases of mNGS in mind.
MIT License
22 stars 4 forks source link

Check samplesheet fix #9

Closed gunarsosis closed 1 year ago

gunarsosis commented 1 year ago

Check_samplesheet.py may not be able to handle certain file names.

Within the script, _validate_pair function checks user's samples sheet and confirms that both .fastq.gz paired files have matching extensions using pathlib’s Path and .suffixes. .suffixes separates the filename by “.” and removes the first iteration while considering the rest as suffixes. In case sample name has multiple “.” delimiters – the suffixes of paired reads will not match.

For example, downsample..fastq.gz will not be taken by the pipeline as script will consider [.’’, ‘fastq’, ‘.gz’] as suffixes and when it will compare the suffixes, will not match on paired files due to differences in their naming resulting in the error “FASTQ pairs must have the same file extensions.” subsequently aborting the pipeline.

Solution: In check_samplesheet.py, on line 123 change as following: Path(row[self._first_col]).suffixes[-2:] == Path(row[self._second_col]).suffixes[-2:]

This will select and compare only the last two suffixes which always should be the same: “.fastq” and “.gz”. This will allow a much greater flexibility in naming of samples. The fix was tested on multiple sample-sheets with success.

CHANGELOG.md is updated to include the explanation of the change.