`KRONA_KTIMPORTTEXT` input file name collision if same 'db_name' is assigned to kraken2 and bracken

MajoroMask commented 1 year ago

Description of the bug

I ran the test profile with a custom reference, where kraken2 and bracken having the same 'db_name'. The --database I'm using looks like below (notice I add '--quick' at 'db_params' for kraken2 so I can mark the difference.)

tool,db_name,db_params,db_path
kraken2,virus,--quick,/path/to/kraken2_ref.tar.gz
bracken,virus,,/path/to/bracken_ref.tar.gz

Error message:

error [nextflow.exception.ProcessUnrecoverableException]: Process `NFCORE_TAXPROFILER:TAXPROFILER:VISUALIZATION_KRONA:KRONA_KTIMPORTTEXT` input file name collision -- There are multiple input files for each of the following file names: 2611_se_virus.txt, 2613_se_virus.txt, ERR3201952_se_virus.txt, 2612_se_virus.txt

I think the problem is that in subworkflows/local/visualization_krona.nf,

In my case, the channel ch_krona_text looks like below. It turns out that the profile channel passes kraken2 result from both main workflow and bracken into downstream.

[[sample:2611, run_accession:ERR5766174, instrument_platform:ILLUMINA, id:2611_se, single_end:true, is_fasta:true, tool:kraken2, db_name:virus, db_params:--quick], /data1/suna/proj/t/work/25/7d16fe70ec53a0e6fff669e62791b6/2611_se.txt]
[[sample:2611, run_accession:ERR5766174, instrument_platform:ILLUMINA, id:2611_se, single_end:true, is_fasta:true, tool:kraken2, db_name:virus, db_params:], /data1/suna/proj/t/work/1f/d343968ec58dab56b043d42994d4d0/2611_se.txt]
[[sample:2613, instrument_platform:ILLUMINA, is_multirun:false, id:2613_se, single_end:true, is_fasta:false, tool:kraken2, db_name:virus, db_params:--quick], /data1/suna/proj/t/work/36/a8dedc2c287b6e787388a0c17a006d/2613_se.txt]
[[sample:2613, instrument_platform:ILLUMINA, is_multirun:false, id:2613_se, single_end:true, is_fasta:false, tool:kraken2, db_name:virus, db_params:], /data1/suna/proj/t/work/66/ff82e120d7cd456bae7da434c40f46/2613_se.txt]
[[sample:ERR3201952, instrument_platform:OXFORD_NANOPORE, is_multirun:false, id:ERR3201952_se, single_end:1, is_fasta:false, tool:kraken2, db_name:virus, db_params:--quick], /data1/suna/proj/t/work/d3/ccd245830b9bc3f13ab7f72c64e044/ERR3201952_se.txt]
[[sample:ERR3201952, instrument_platform:OXFORD_NANOPORE, is_multirun:false, id:ERR3201952_se, single_end:1, is_fasta:false, tool:kraken2, db_name:virus, db_params:], /data1/suna/proj/t/work/3f/a43616631f78b5266120db0fd50b20/ERR3201952_se.txt]
[[sample:2612, instrument_platform:ILLUMINA, is_multirun:true, id:2612_se, single_end:true, is_fasta:false, tool:kraken2, db_name:virus, db_params:], /data1/suna/proj/t/work/d3/6faaeb791e7403e715cfe442e3593a/2612_se.txt]
[[sample:2612, instrument_platform:ILLUMINA, is_multirun:true, id:2612_se, single_end:true, is_fasta:false, tool:kraken2, db_name:virus, db_params:--quick], /data1/suna/proj/t/work/ad/12f357c35f0cea2ba729b5f33c6c54/2612_se.txt]

As contrast, the profile do tell the differences in file name between kraken2 called by the main workflow and by bracken. In this case the profile channel looks like this:

[[sample:2611, run_accession:ERR5766174, instrument_platform:ILLUMINA, id:2611_se, single_end:true, is_fasta:true, tool:kraken2, db_name:virus, db_params:], /data1/suna/proj/t/work/ef/f7a190be9d82623ad18847a6f68c6a/2611_se_virus.bracken.kraken2.report.txt]
[[sample:2611, run_accession:ERR5766174, instrument_platform:ILLUMINA, id:2611_se, single_end:true, is_fasta:true, tool:kraken2, db_name:virus, db_params:--quick], /data1/suna/proj/t/work/ab/d665b526749ea423fb07e513a993fc/2611_se_virus.kraken2.kraken2.report.txt]
[[sample:2611, run_accession:ERR5766174, instrument_platform:ILLUMINA, id:2611_se, single_end:true, is_fasta:true, tool:bracken, db_name:virus, db_params:], /data1/suna/proj/t/work/5e/76e4cf1bfd768b37924548eb7f8f7a/2611_se_virus.bracken.tsv]
[[sample:2613, instrument_platform:ILLUMINA, is_multirun:false, id:2613_se, single_end:true, is_fasta:false, tool:kraken2, db_name:virus, db_params:], /data1/suna/proj/t/work/dd/3e0260e7b065c15daa7dd645d736b0/2613_se_virus.bracken.kraken2.report.txt]
[[sample:2613, instrument_platform:ILLUMINA, is_multirun:false, id:2613_se, single_end:true, is_fasta:false, tool:kraken2, db_name:virus, db_params:--quick], /data1/suna/proj/t/work/18/c743ea1a576bc9879af4639fe6066d/2613_se_virus.kraken2.kraken2.report.txt]
[[sample:2613, instrument_platform:ILLUMINA, is_multirun:false, id:2613_se, single_end:true, is_fasta:false, tool:bracken, db_name:virus, db_params:], /data1/suna/proj/t/work/e7/6ec5e58661a2d87437c6b1d2cebcef/2613_se_virus.bracken.tsv]
[[sample:ERR3201952, instrument_platform:OXFORD_NANOPORE, is_multirun:false, id:ERR3201952_se, single_end:1, is_fasta:false, tool:kraken2, db_name:virus, db_params:], /data1/suna/proj/t/work/3e/79a17de0578812673270a2aff33233/ERR3201952_se_virus.bracken.kraken2.report.txt]
[[sample:ERR3201952, instrument_platform:OXFORD_NANOPORE, is_multirun:false, id:ERR3201952_se, single_end:1, is_fasta:false, tool:kraken2, db_name:virus, db_params:--quick], /data1/suna/proj/t/work/07/fe74b9e272ccabf100fdfcad9f0e79/ERR3201952_se_virus.kraken2.kraken2.report.txt]
[[sample:2612, instrument_platform:ILLUMINA, is_multirun:true, id:2612_se, single_end:true, is_fasta:false, tool:kraken2, db_name:virus, db_params:], /data1/suna/proj/t/work/d5/04d9c1b143371229040930e7727229/2612_se_virus.bracken.kraken2.report.txt]
[[sample:2612, instrument_platform:ILLUMINA, is_multirun:true, id:2612_se, single_end:true, is_fasta:false, tool:kraken2, db_name:virus, db_params:--quick], /data1/suna/proj/t/work/49/7379a2e9f4a64bc61066c8e2eb511a/2612_se_virus.kraken2.kraken2.report.txt]
[[sample:2612, instrument_platform:ILLUMINA, is_multirun:true, id:2612_se, single_end:true, is_fasta:false, tool:bracken, db_name:virus, db_params:], /data1/suna/proj/t/work/b1/ef3eb1115428686809aeeb0adc68e5/2612_se_virus.bracken.tsv]

If the same db_name assigned to kraken2 and bracken (which I think is reasonable since I built this two references on the same genome sequences and the same NCBI taxdump), channel ch_krona_text_for_import will have files with exactly same file name as input (in this case I two file from each sample, like two '2611_se.txt'), then cause the error.

    /*
        Convert Krona text files into html Krona visualizations
    */
    ch_krona_text_for_import = ch_cleaned_krona_text
        .map{[[id: it[0]['db_name'], tool: it[0]['tool']], it[1]]}
        .groupTuple()  // groupTuple() by `db_name` and `tool`

    KRONA_KTIMPORTTEXT( ch_krona_text_for_import )

Command used and terminal output

No response

Relevant files

No response

System information

No response

MajoroMask commented 1 year ago

should I just use different 'db_name' between different tools? I'm not quite sure if it's reasonable, but for now it seems to be a proper workaround.

jfy133 commented 1 year ago

Yes, it is the proper workaround. We will need to think if we should enforce that or not and fail at input check if yes.

jfy133 commented 1 year ago

OK there is a vote to keep allowing the same database name.

Will need to then work out how to remove the bracken kraken2 output from ch_krona_text input channel, as correctly identified @MajoroMask !

jfy133 commented 1 year ago

So the issue derives from here, so I guess need to reflect if this is actually would be a problem or not.

I think the general solution will be make a fake profiler name... and use 'else|or' statements in downstream places - currently those files are used in the profile standardisation and visualisation workflows

jfy133 commented 1 year ago

Lookinga t teh comment, indeed we made an faulty assumption about the database names :sweat_smile:

nf-core / taxprofiler