replikation / poreCov

SARS-CoV-2 workflow for nanopore sequence data
https://case-group.github.io/
GNU General Public License v3.0
39 stars 16 forks source link

Catch identical filenames #192

Open MarieLataretu opened 2 years ago

MarieLataretu commented 2 years ago

I suggest to change simpleName to baseName here: https://github.com/replikation/poreCov/blob/9ba98fe38d666508fa7dd0bd16d4accc5fe36a4b/poreCov.nf#L183 (and potentially somewhere else) to avoid problems with file names with more than one ..

Else or in addition a sanity check for identical file names would be good.


Context: https://www.nextflow.io/docs/latest/script.html#check-file-attributes

replikation commented 2 years ago

maybe there is a way to just remove the ".fastq.gz" or ".fastq" ? because with basename the .fastq remains in the sample names

replikation commented 2 years ago

https://stackoverflow.com/questions/17676562/get-file-extension-for-special-cases-like-tar-gz

hoelzer commented 2 years ago

But then we should also cover .fq, .fq.gz ... on the other hand it's not the worst when the sample names still have the .fq extension but the pipeline still runs through ;) just if we miss some weired file end

MarieLataretu commented 2 years ago

because with basename the .fastq remains in the sample names

True, haven't thought about that.


Here a code snippet for the sanity check

Channel
    .from('Hello','Hola','Ciao')
    .tap {all} // to conserve the original channel
    .collect()
    .map{ it -> [it.size(), it.unique().size()]}
    .subscribe onNext: { 
        assert it[0] == it[1]
    }
replikation commented 1 year ago

ping @DataSpott