Closed SannaAb closed 10 months ago
Probably related to the file pattern in the publishing options.
Hrmm, not sure.
publishDir = [
path: { "${params.outdir}/krakenuniq/${meta.db_name}/" },
mode: params.publish_dir_mode,
pattern: '*.{txt,fastq.gz}'
]
should pick up anything with .fastq.gz
...
Maybe the #
in the file name braeks the glob?
I confirm the bug:
## The results directory
(nf-core) james@bionb103:~/git/nf-core/taxprofiler/testing (krakenuniq_pairedend_bug)$ ls -l results/krakenuniq/db6/ER
ERR3201952.classified.fastq.gz ERX5474930_ERR5766174_1.krakenuniq.report.txt ERX5474936_ERR5766180_1.classified.fastq.gz
ERR3201952.krakenuniq.classified.txt ERX5474930_ERR5766174_1.unclassified.fastq.gz ERX5474936_ERR5766180_1.krakenuniq.classified.txt
ERR3201952.krakenuniq.report.txt ERX5474932_ERR5766176_B.krakenuniq.classified.txt ERX5474936_ERR5766180_1.krakenuniq.report.txt
ERR3201952.unclassified.fastq.gz ERX5474932_ERR5766176_B.krakenuniq.report.txt ERX5474936_ERR5766180_1.unclassified.fastq.gz
ERX5474930_ERR5766174_1.classified.fastq.gz ERX5474932_ERR5766176.krakenuniq.classified.txt ERX5474937_ERR5766181.krakenuniq.classified.txt
ERX5474930_ERR5766174_1.krakenuniq.classified.txt ERX5474932_ERR5766176.krakenuniq.report.txt ERX5474937_ERR5766181.krakenuniq.report.txt
## Paired end KrakenUniq process work dir
(nf-core) james@bionb103:~/git/nf-core/taxprofiler/testing (krakenuniq_pairedend_bug)$ ls -l work/65/95e1322524fd14ac49f21cebeb80ff/
.command.begin ERX5474932_ERR5766176_B_2.fastq.gz ERX5474937_ERR5766181_2.fastq.gz
.command.err ERX5474932_ERR5766176_B.classified#.fastq.gz ERX5474937_ERR5766181.classified#.fastq.gz
.command.log ERX5474932_ERR5766176_B.krakenuniq.classified.txt ERX5474937_ERR5766181.krakenuniq.classified.txt
.command.out ERX5474932_ERR5766176_B.krakenuniq.report.txt ERX5474937_ERR5766181.krakenuniq.report.txt
.command.run ERX5474932_ERR5766176_B.unclassified#.fastq.gz ERX5474937_ERR5766181.unclassified#.fastq.gz
.command.sh ERX5474932_ERR5766176.classified#.fastq.gz .exitcode
.command.trace ERX5474932_ERR5766176.krakenuniq.classified.txt testdb-krakenuniq/
ERX5474932_ERR5766176_1.fastq.gz ERX5474932_ERR5766176.krakenuniq.report.txt versions.yml
ERX5474932_ERR5766176_2.fastq.gz ERX5474932_ERR5766176.unclassified#.fastq.gz
ERX5474932_ERR5766176_B_1.fastq.gz ERX5474937_ERR5766181_1.fastq.gz
## Single end KrakenUniq process work dir
(nf-core) james@bionb103:~/git/nf-core/taxprofiler/testing (krakenuniq_pairedend_bug)$ ls -l work/42/db346910d57723460033ef279a998d/
.command.begin ERR3201952.krakenuniq.classified.txt ERX5474936_ERR5766180_1.fastq.gz
.command.err ERR3201952.krakenuniq.report.txt ERX5474936_ERR5766180_1.krakenuniq.classified.txt
.command.log ERR3201952.unclassified.fastq.gz ERX5474936_ERR5766180_1.krakenuniq.report.txt
.command.out ERX5474930_ERR5766174_1.classified.fastq.gz ERX5474936_ERR5766180_1.unclassified.fastq.gz
.command.run ERX5474930_ERR5766174_1.fa.gz .exitcode
.command.sh ERX5474930_ERR5766174_1.krakenuniq.classified.txt testdb-krakenuniq/
.command.trace ERX5474930_ERR5766174_1.krakenuniq.report.txt versions.yml
ERR3201952.classified.fastq.gz ERX5474930_ERR5766174_1.unclassified.fastq.gz
ERR3201952.fastq.gz ERX5474936_ERR5766180_1.classified.fastq.gz
I guess need to test teh glob a bit, because theoretically it should be fine...
The output patterns don't expect a #
in the filename. They look like they might have overlapping matches, too.
It's kinda weird to see #
in the filename. Shouldn't it be replaced with _1
and _2
?
output:
tuple val(meta), path('*.classified{.,_}*') , optional:true, emit: classified_reads_fastq
tuple val(meta), path('*.unclassified{.,_}*') , optional:true, emit: unclassified_reads_fastq
tuple val(meta), path('*classified.txt') , optional:true, emit: classified_assignment
The output patterns don't expect a
#
in the filename. They look like they might have overlapping matches, too.It's kinda weird to see
#
in the filename. Shouldn't it be replaced with_1
and_2
?output: tuple val(meta), path('*.classified{.,_}*') , optional:true, emit: classified_reads_fastq tuple val(meta), path('*.unclassified{.,_}*') , optional:true, emit: unclassified_reads_fastq tuple val(meta), path('*classified.txt') , optional:true, emit: classified_assignment
Ach I always forget the publishDir is guided by the channels 🙄.
I'll try patching the module to see if that fixes it with the glob pattern...
Yeah it's nasty but default output from the tool. That said that module is nasty anyway, as I'm sure you remember from writing it 😅
No idea what you're talking about... 🙈
OK looking through the module code, it's you @Midnighter who added the #
apparently :sweat_smile:
Do you happen to remember why you added that? I guess it's something to do with having paired end data but I can't find anything in the KrakenUniq docs about it. Otherwise I suggestI just remove that bit?
Me?! 😱 Sorry, will need to pull up the module and take a look.
Yes you 😬 git blame never lies.. (Well most of the time anyway)
Just for the record, I did make the last change to that line, but I did not introduce the #
notation https://github.com/nf-core/modules/pull/2553/files
Ah, the tests are stub runs that's why this was never caught before.
So krakenuniq's help does actually not properly explain how output for paired-end reads should be specified.
Usage: krakenuniq --report-file FILENAME [options] <filename(s)>
Options:
--db NAME Name for Kraken DB (default: none)
--threads NUM Number of threads (default: 1)
--hll-precision INT Precision for HyperLogLog k-mer cardinality estimation, between 10 and 18 (default: 12)
--exact Compute exact cardinality instead of estimate (slower, requires memory proportional to cardinality!)
--quick Quick operation (use first hit or hits)
--min-hits NUM In quick op., number of hits req'd for classification
NOTE: this is ignored if --quick is not specified
--unclassified-out FILENAME
Print unclassified sequences to filename
--classified-out FILENAME
Print classified sequences to filename
--output FILENAME Print output to filename (default: stdout); "off" will
suppress normal output
--only-classified-output
Print no Kraken output for unclassified sequences
--preload Loads the entire DB into memory before classification
--preload-size SIZE Loads DB into memory in chunks of SIZE, e.g. 500M or 7G (if RAM is small), overrides --preload flag
--paired The two filenames provided are paired-end reads
--check-names Ensure each pair of reads have names that agree
with each other; ignored if --paired is not specified
--help Print this message
--version Print version information
Experimental:
--uid-mapping Map using UID database
The file format (fasta/fastq) and compression (gzip/bzip2) do not need to be specified anymore.
The format is detected automatically.
and there exists an issue on the topic open since March 2020. So that's not very promising... I'm not sure if we can actually fix this problem. We may have to just ignore/forbid saving reads with paired-end data.
@jfy133 have we learnt anything new in the meantime regarding krakenuniq test databases that would let us run the module tests properly rather than using stub runs?
I don't know if it's that important to be honest, for me we just need to see e.g. if you ran manually, is there a difference between the output fastqs if you run with single Vs paired data i.e. do we need different glob patterns for the two cases (my impression is maybe not?). Whether it's right or not is an issue of the tool that isn't up for us to fix, I would rather let the user decide if they want to use it or not.
No still not :(
I did some more testing. The saved output is indeed in FASTA format and I'm fairly confident that the pairs have been merged for the output. Take one of the test outputs as an example. ERX5474937_ERR5766181.unclassified#.fastq.gz
contains a sequence named ERR5766181.1652
. This name occurs in both ERX5474937_ERR5766181_1.fastq.gz
and ERX5474937_ERR5766181_2.fastq.gz
. In each of those FASTQ files, the sequencing read is 150 characters long. However, in the FASTA ERX5474937_ERR5766181.unclassified#.fastq.gz
, the sequence is 303 characters long.
I'm about to propose some changes to the module that will hopefully clarify the situation.
My bad, linking the issue closed this, when it should be closed only after the pipeline is fixed.
This might end up wackamole 😅
What have I done 😱 🙈
Description of the bug
Issue:
When running
--krakenuniq_save_reads
and--krakenuniq_save_readclassifications
I don´t get the fastq files into the output folder when using paried end data (it works for single end). Going into the work directory the fastq files are there but they end with.classified#.fastq.gz
and.unclassified#.fastq.gz
.Any ideas what could be wrong?
Thanks for the help!
Command used and terminal output
Relevant files
No response
System information
No response