Krakenuniq save_reads does not give the fastq files when using PE data

SannaAb commented 10 months ago

Description of the bug

Issue:

When running --krakenuniq_save_reads and --krakenuniq_save_readclassifications I don´t get the fastq files into the output folder when using paried end data (it works for single end). Going into the work directory the fastq files are there but they end with .classified#.fastq.gz and .unclassified#.fastq.gz .

Any ideas what could be wrong?

Thanks for the help!

Command used and terminal output

nextflow run nf-core/taxprofiler --perform_shortread_qc --shortread_qc_tool 'fastp' --perform_shortread_hostremoval --shortread_hostremoval_index $hostindex --hostremoval_reference $hostfasta --save_preprocessed_reads --perform_shortread_complexityfilter --shortread_complexityfilter_tool 'bbduk' -profile singularity --input $SampleSheet --databases $DatabaseSheet --outdir $outDir -r 1.1.2 -c $MandaloreConfig --run_kraken2 --kraken2_save_readclassifications --kraken2_save_reads --run_krakenuniq --krakenuniq_save_reads --krakenuniq_save_readclassifications --run_krona --save_hostremoval_unmapped -resume

Relevant files

No response

System information

No response

Midnighter commented 10 months ago

Probably related to the file pattern in the publishing options.

jfy133 commented 10 months ago

Hrmm, not sure.

        publishDir = [
            path: { "${params.outdir}/krakenuniq/${meta.db_name}/" },
            mode: params.publish_dir_mode,
            pattern: '*.{txt,fastq.gz}'
        ]

should pick up anything with .fastq.gz...

Maybe the # in the file name braeks the glob?

jfy133 commented 10 months ago

I confirm the bug:

## The results directory
(nf-core) james@bionb103:~/git/nf-core/taxprofiler/testing (krakenuniq_pairedend_bug)$ ls -l results/krakenuniq/db6/ER
ERR3201952.classified.fastq.gz                     ERX5474930_ERR5766174_1.krakenuniq.report.txt      ERX5474936_ERR5766180_1.classified.fastq.gz
ERR3201952.krakenuniq.classified.txt               ERX5474930_ERR5766174_1.unclassified.fastq.gz      ERX5474936_ERR5766180_1.krakenuniq.classified.txt
ERR3201952.krakenuniq.report.txt                   ERX5474932_ERR5766176_B.krakenuniq.classified.txt  ERX5474936_ERR5766180_1.krakenuniq.report.txt
ERR3201952.unclassified.fastq.gz                   ERX5474932_ERR5766176_B.krakenuniq.report.txt      ERX5474936_ERR5766180_1.unclassified.fastq.gz
ERX5474930_ERR5766174_1.classified.fastq.gz        ERX5474932_ERR5766176.krakenuniq.classified.txt    ERX5474937_ERR5766181.krakenuniq.classified.txt
ERX5474930_ERR5766174_1.krakenuniq.classified.txt  ERX5474932_ERR5766176.krakenuniq.report.txt        ERX5474937_ERR5766181.krakenuniq.report.txt
## Paired end KrakenUniq process work dir
(nf-core) james@bionb103:~/git/nf-core/taxprofiler/testing (krakenuniq_pairedend_bug)$ ls -l work/65/95e1322524fd14ac49f21cebeb80ff/
.command.begin                                     ERX5474932_ERR5766176_B_2.fastq.gz                 ERX5474937_ERR5766181_2.fastq.gz
.command.err                                       ERX5474932_ERR5766176_B.classified#.fastq.gz       ERX5474937_ERR5766181.classified#.fastq.gz
.command.log                                       ERX5474932_ERR5766176_B.krakenuniq.classified.txt  ERX5474937_ERR5766181.krakenuniq.classified.txt
.command.out                                       ERX5474932_ERR5766176_B.krakenuniq.report.txt      ERX5474937_ERR5766181.krakenuniq.report.txt
.command.run                                       ERX5474932_ERR5766176_B.unclassified#.fastq.gz     ERX5474937_ERR5766181.unclassified#.fastq.gz
.command.sh                                        ERX5474932_ERR5766176.classified#.fastq.gz         .exitcode
.command.trace                                     ERX5474932_ERR5766176.krakenuniq.classified.txt    testdb-krakenuniq/
ERX5474932_ERR5766176_1.fastq.gz                   ERX5474932_ERR5766176.krakenuniq.report.txt        versions.yml
ERX5474932_ERR5766176_2.fastq.gz                   ERX5474932_ERR5766176.unclassified#.fastq.gz       
ERX5474932_ERR5766176_B_1.fastq.gz                 ERX5474937_ERR5766181_1.fastq.gz               
## Single end KrakenUniq process work dir
(nf-core) james@bionb103:~/git/nf-core/taxprofiler/testing (krakenuniq_pairedend_bug)$ ls -l work/42/db346910d57723460033ef279a998d/
.command.begin                                     ERR3201952.krakenuniq.classified.txt               ERX5474936_ERR5766180_1.fastq.gz
.command.err                                       ERR3201952.krakenuniq.report.txt                   ERX5474936_ERR5766180_1.krakenuniq.classified.txt
.command.log                                       ERR3201952.unclassified.fastq.gz                   ERX5474936_ERR5766180_1.krakenuniq.report.txt
.command.out                                       ERX5474930_ERR5766174_1.classified.fastq.gz        ERX5474936_ERR5766180_1.unclassified.fastq.gz
.command.run                                       ERX5474930_ERR5766174_1.fa.gz                      .exitcode
.command.sh                                        ERX5474930_ERR5766174_1.krakenuniq.classified.txt  testdb-krakenuniq/
.command.trace                                     ERX5474930_ERR5766174_1.krakenuniq.report.txt      versions.yml
ERR3201952.classified.fastq.gz                     ERX5474930_ERR5766174_1.unclassified.fastq.gz      
ERR3201952.fastq.gz                                ERX5474936_ERR5766180_1.classified.fastq.gz

I guess need to test teh glob a bit, because theoretically it should be fine...

Midnighter commented 10 months ago

The output patterns don't expect a # in the filename. They look like they might have overlapping matches, too.

It's kinda weird to see # in the filename. Shouldn't it be replaced with _1 and _2?

    output:
    tuple val(meta), path('*.classified{.,_}*')     , optional:true, emit: classified_reads_fastq
    tuple val(meta), path('*.unclassified{.,_}*')   , optional:true, emit: unclassified_reads_fastq
    tuple val(meta), path('*classified.txt')        , optional:true, emit: classified_assignment

jfy133 commented 10 months ago

The output patterns don't expect a # in the filename. They look like they might have overlapping matches, too.

It's kinda weird to see # in the filename. Shouldn't it be replaced with _1 and _2?
    output:
    tuple val(meta), path('*.classified{.,_}*')     , optional:true, emit: classified_reads_fastq
    tuple val(meta), path('*.unclassified{.,_}*')   , optional:true, emit: unclassified_reads_fastq
    tuple val(meta), path('*classified.txt')        , optional:true, emit: classified_assignment

Ach I always forget the publishDir is guided by the channels 🙄.

I'll try patching the module to see if that fixes it with the glob pattern...

Yeah it's nasty but default output from the tool. That said that module is nasty anyway, as I'm sure you remember from writing it 😅

Midnighter commented 10 months ago

No idea what you're talking about... 🙈

jfy133 commented 10 months ago

OK looking through the module code, it's you @Midnighter who added the # apparently :sweat_smile:

https://github.com/nf-core/taxprofiler/blame/3d4eda2dbb9d05a8f8b7f3cde4ed6e479189d8fa/modules/nf-core/krakenuniq/preloadedkrakenuniq/main.nf#L33-L34

Do you happen to remember why you added that? I guess it's something to do with having paired end data but I can't find anything in the KrakenUniq docs about it. Otherwise I suggestI just remove that bit?

Midnighter commented 10 months ago

Me?! 😱 Sorry, will need to pull up the module and take a look.

jfy133 commented 10 months ago

Yes you 😬 git blame never lies.. (Well most of the time anyway)

Midnighter commented 10 months ago

Just for the record, I did make the last change to that line, but I did not introduce the # notation https://github.com/nf-core/modules/pull/2553/files

Midnighter commented 10 months ago

Ah, the tests are stub runs that's why this was never caught before.

Midnighter commented 10 months ago

So krakenuniq's help does actually not properly explain how output for paired-end reads should be specified.

Usage: krakenuniq --report-file FILENAME [options] <filename(s)>

Options:
  --db NAME               Name for Kraken DB (default: none)
  --threads NUM           Number of threads (default: 1)
  --hll-precision INT     Precision for HyperLogLog k-mer cardinality estimation, between 10 and 18 (default: 12)
  --exact                 Compute exact cardinality instead of estimate (slower, requires memory proportional to cardinality!)
  --quick                 Quick operation (use first hit or hits)
  --min-hits NUM          In quick op., number of hits req'd for classification
                          NOTE: this is ignored if --quick is not specified
  --unclassified-out FILENAME
                          Print unclassified sequences to filename
  --classified-out FILENAME
                          Print classified sequences to filename
  --output FILENAME       Print output to filename (default: stdout); "off" will
                          suppress normal output
  --only-classified-output
                          Print no Kraken output for unclassified sequences
  --preload               Loads the entire DB into memory before classification
  --preload-size SIZE     Loads DB into memory in chunks of SIZE, e.g. 500M or 7G (if RAM is small), overrides --preload flag
  --paired                The two filenames provided are paired-end reads
  --check-names           Ensure each pair of reads have names that agree
                          with each other; ignored if --paired is not specified
  --help                  Print this message
  --version               Print version information

Experimental:
  --uid-mapping           Map using UID database

The file format (fasta/fastq) and compression (gzip/bzip2) do not need to be specified anymore.
The format is detected automatically.

and there exists an issue on the topic open since March 2020. So that's not very promising... I'm not sure if we can actually fix this problem. We may have to just ignore/forbid saving reads with paired-end data.

Midnighter commented 10 months ago

@jfy133 have we learnt anything new in the meantime regarding krakenuniq test databases that would let us run the module tests properly rather than using stub runs?

jfy133 commented 10 months ago

I don't know if it's that important to be honest, for me we just need to see e.g. if you ran manually, is there a difference between the output fastqs if you run with single Vs paired data i.e. do we need different glob patterns for the two cases (my impression is maybe not?). Whether it's right or not is an issue of the tool that isn't up for us to fix, I would rather let the user decide if they want to use it or not.
No still not :(

Midnighter commented 10 months ago

I did some more testing. The saved output is indeed in FASTA format and I'm fairly confident that the pairs have been merged for the output. Take one of the test outputs as an example. ERX5474937_ERR5766181.unclassified#.fastq.gz contains a sequence named ERR5766181.1652. This name occurs in both ERX5474937_ERR5766181_1.fastq.gz and ERX5474937_ERR5766181_2.fastq.gz. In each of those FASTQ files, the sequencing read is 150 characters long. However, in the FASTA ERX5474937_ERR5766181.unclassified#.fastq.gz, the sequence is 303 characters long.

I'm about to propose some changes to the module that will hopefully clarify the situation.

Midnighter commented 10 months ago

My bad, linking the issue closed this, when it should be closed only after the pipeline is fixed.

jfy133 commented 10 months ago

This might end up wackamole 😅

Midnighter commented 10 months ago

What have I done 😱 🙈

nf-core / taxprofiler