nf-core / taxprofiler

Highly parallelised multi-taxonomic profiling of shotgun short- and long-read metagenomic data
https://nf-co.re/taxprofiler
MIT License
116 stars 33 forks source link

`KRONA_KTIMPORTTEXT` input file name collision if same 'db_name' is assigned to kraken2 and bracken #395

Closed MajoroMask closed 10 months ago

MajoroMask commented 11 months ago

Description of the bug

I ran the test profile with a custom reference, where kraken2 and bracken having the same 'db_name'. The --database I'm using looks like below (notice I add '--quick' at 'db_params' for kraken2 so I can mark the difference.)

tool,db_name,db_params,db_path
kraken2,virus,--quick,/path/to/kraken2_ref.tar.gz
bracken,virus,,/path/to/bracken_ref.tar.gz

Error message:

error [nextflow.exception.ProcessUnrecoverableException]: Process `NFCORE_TAXPROFILER:TAXPROFILER:VISUALIZATION_KRONA:KRONA_KTIMPORTTEXT` input file name collision -- There are multiple input files for each of the following file names: 2611_se_virus.txt, 2613_se_virus.txt, ERR3201952_se_virus.txt, 2612_se_virus.txt

I think the problem is that in subworkflows/local/visualization_krona.nf,

In my case, the channel ch_krona_text looks like below. It turns out that the profile channel passes kraken2 result from both main workflow and bracken into downstream.

[[sample:2611, run_accession:ERR5766174, instrument_platform:ILLUMINA, id:2611_se, single_end:true, is_fasta:true, tool:kraken2, db_name:virus, db_params:--quick], /data1/suna/proj/t/work/25/7d16fe70ec53a0e6fff669e62791b6/2611_se.txt]
[[sample:2611, run_accession:ERR5766174, instrument_platform:ILLUMINA, id:2611_se, single_end:true, is_fasta:true, tool:kraken2, db_name:virus, db_params:], /data1/suna/proj/t/work/1f/d343968ec58dab56b043d42994d4d0/2611_se.txt]
[[sample:2613, instrument_platform:ILLUMINA, is_multirun:false, id:2613_se, single_end:true, is_fasta:false, tool:kraken2, db_name:virus, db_params:--quick], /data1/suna/proj/t/work/36/a8dedc2c287b6e787388a0c17a006d/2613_se.txt]
[[sample:2613, instrument_platform:ILLUMINA, is_multirun:false, id:2613_se, single_end:true, is_fasta:false, tool:kraken2, db_name:virus, db_params:], /data1/suna/proj/t/work/66/ff82e120d7cd456bae7da434c40f46/2613_se.txt]
[[sample:ERR3201952, instrument_platform:OXFORD_NANOPORE, is_multirun:false, id:ERR3201952_se, single_end:1, is_fasta:false, tool:kraken2, db_name:virus, db_params:--quick], /data1/suna/proj/t/work/d3/ccd245830b9bc3f13ab7f72c64e044/ERR3201952_se.txt]
[[sample:ERR3201952, instrument_platform:OXFORD_NANOPORE, is_multirun:false, id:ERR3201952_se, single_end:1, is_fasta:false, tool:kraken2, db_name:virus, db_params:], /data1/suna/proj/t/work/3f/a43616631f78b5266120db0fd50b20/ERR3201952_se.txt]
[[sample:2612, instrument_platform:ILLUMINA, is_multirun:true, id:2612_se, single_end:true, is_fasta:false, tool:kraken2, db_name:virus, db_params:], /data1/suna/proj/t/work/d3/6faaeb791e7403e715cfe442e3593a/2612_se.txt]
[[sample:2612, instrument_platform:ILLUMINA, is_multirun:true, id:2612_se, single_end:true, is_fasta:false, tool:kraken2, db_name:virus, db_params:--quick], /data1/suna/proj/t/work/ad/12f357c35f0cea2ba729b5f33c6c54/2612_se.txt]

As contrast, the profile do tell the differences in file name between kraken2 called by the main workflow and by bracken. In this case the profile channel looks like this:

[[sample:2611, run_accession:ERR5766174, instrument_platform:ILLUMINA, id:2611_se, single_end:true, is_fasta:true, tool:kraken2, db_name:virus, db_params:], /data1/suna/proj/t/work/ef/f7a190be9d82623ad18847a6f68c6a/2611_se_virus.bracken.kraken2.report.txt]
[[sample:2611, run_accession:ERR5766174, instrument_platform:ILLUMINA, id:2611_se, single_end:true, is_fasta:true, tool:kraken2, db_name:virus, db_params:--quick], /data1/suna/proj/t/work/ab/d665b526749ea423fb07e513a993fc/2611_se_virus.kraken2.kraken2.report.txt]
[[sample:2611, run_accession:ERR5766174, instrument_platform:ILLUMINA, id:2611_se, single_end:true, is_fasta:true, tool:bracken, db_name:virus, db_params:], /data1/suna/proj/t/work/5e/76e4cf1bfd768b37924548eb7f8f7a/2611_se_virus.bracken.tsv]
[[sample:2613, instrument_platform:ILLUMINA, is_multirun:false, id:2613_se, single_end:true, is_fasta:false, tool:kraken2, db_name:virus, db_params:], /data1/suna/proj/t/work/dd/3e0260e7b065c15daa7dd645d736b0/2613_se_virus.bracken.kraken2.report.txt]
[[sample:2613, instrument_platform:ILLUMINA, is_multirun:false, id:2613_se, single_end:true, is_fasta:false, tool:kraken2, db_name:virus, db_params:--quick], /data1/suna/proj/t/work/18/c743ea1a576bc9879af4639fe6066d/2613_se_virus.kraken2.kraken2.report.txt]
[[sample:2613, instrument_platform:ILLUMINA, is_multirun:false, id:2613_se, single_end:true, is_fasta:false, tool:bracken, db_name:virus, db_params:], /data1/suna/proj/t/work/e7/6ec5e58661a2d87437c6b1d2cebcef/2613_se_virus.bracken.tsv]
[[sample:ERR3201952, instrument_platform:OXFORD_NANOPORE, is_multirun:false, id:ERR3201952_se, single_end:1, is_fasta:false, tool:kraken2, db_name:virus, db_params:], /data1/suna/proj/t/work/3e/79a17de0578812673270a2aff33233/ERR3201952_se_virus.bracken.kraken2.report.txt]
[[sample:ERR3201952, instrument_platform:OXFORD_NANOPORE, is_multirun:false, id:ERR3201952_se, single_end:1, is_fasta:false, tool:kraken2, db_name:virus, db_params:--quick], /data1/suna/proj/t/work/07/fe74b9e272ccabf100fdfcad9f0e79/ERR3201952_se_virus.kraken2.kraken2.report.txt]
[[sample:2612, instrument_platform:ILLUMINA, is_multirun:true, id:2612_se, single_end:true, is_fasta:false, tool:kraken2, db_name:virus, db_params:], /data1/suna/proj/t/work/d5/04d9c1b143371229040930e7727229/2612_se_virus.bracken.kraken2.report.txt]
[[sample:2612, instrument_platform:ILLUMINA, is_multirun:true, id:2612_se, single_end:true, is_fasta:false, tool:kraken2, db_name:virus, db_params:--quick], /data1/suna/proj/t/work/49/7379a2e9f4a64bc61066c8e2eb511a/2612_se_virus.kraken2.kraken2.report.txt]
[[sample:2612, instrument_platform:ILLUMINA, is_multirun:true, id:2612_se, single_end:true, is_fasta:false, tool:bracken, db_name:virus, db_params:], /data1/suna/proj/t/work/b1/ef3eb1115428686809aeeb0adc68e5/2612_se_virus.bracken.tsv]

If the same db_name assigned to kraken2 and bracken (which I think is reasonable since I built this two references on the same genome sequences and the same NCBI taxdump), channel ch_krona_text_for_import will have files with exactly same file name as input (in this case I two file from each sample, like two '2611_se.txt'), then cause the error.

    /*
        Convert Krona text files into html Krona visualizations
    */
    ch_krona_text_for_import = ch_cleaned_krona_text
        .map{[[id: it[0]['db_name'], tool: it[0]['tool']], it[1]]}
        .groupTuple()  // groupTuple() by `db_name` and `tool`

    KRONA_KTIMPORTTEXT( ch_krona_text_for_import )

Command used and terminal output

No response

Relevant files

No response

System information

No response

MajoroMask commented 11 months ago

should I just use different 'db_name' between different tools? I'm not quite sure if it's reasonable, but for now it seems to be a proper workaround.

jfy133 commented 11 months ago

Yes, it is the proper workaround. We will need to think if we should enforce that or not and fail at input check if yes.

jfy133 commented 10 months ago

OK there is a vote to keep allowing the same database name.

Will need to then work out how to remove the bracken kraken2 output from ch_krona_text input channel, as correctly identified @MajoroMask !

jfy133 commented 10 months ago

So the issue derives from here, so I guess need to reflect if this is actually would be a problem or not.

I think the general solution will be make a fake profiler name... and use 'else|or' statements in downstream places - currently those files are used in the profile standardisation and visualisation workflows

jfy133 commented 10 months ago

Lookinga t teh comment, indeed we made an faulty assumption about the database names :sweat_smile: