nf-core / taxprofiler

Highly parallelised multi-taxonomic profiling of shotgun short- and long-read metagenomic data
https://nf-co.re/taxprofiler
MIT License
116 stars 33 forks source link

process `VISUALIZATION_KRONA` mis-combined channel by wrong index #394

Closed MajoroMask closed 10 months ago

MajoroMask commented 11 months ago

Description of the bug

In subworkflows/local/visualization_krona.nf

    /*
        Combine Kaiju profiles with their databases
    */
    ch_input_for_kaiju2krona = ch_input_classifications.kaiju
        .map{ [it[0]['db_name'], it[0], it[1]] }
        .combine( databases.map{ [it[0]['db_name'], it[1]] }, by: 0 )
        .multiMap{
            it ->
                profiles: [it[1], it[2]]
                db: it[3]
        }

As far as I am concerned, it should be:

    /*
        Combine Kaiju profiles with their databases
    */
    ch_input_for_kaiju2krona = ch_input_classifications.kaiju
-        .map{ [it[0]['db_name'], it[0], it[1]] }
-        .combine( databases.map{ [it[0]['db_name'], it[1]] }, by: 0 )
+        .map{ [it[0]['tool'], it[0], it[1]] }
+        .combine( databases.map{ [it[0]['tool'], it[1]] }, by: 0 )
        .multiMap{
            it ->
                profiles: [it[1], it[2]]
                db: it[3]
        }

I ran into this bug when I'm testing my own reference with the same db_name of kaiju and diamond. The --databases I ueed goes like below. And to trigger this bug, --run_diamond --run_kaiju --run_krona is needed in this case.

tool,db_name,db_params,db_path
kaiju,virus_prot,,/path/to/kaiju_ref.tar.gz
diamond,virus_prot,,/path/to/diamond_ref.dmnd

Hope this report would be of help to this project. Looking forward for replies.

Command used and terminal output

No response

Relevant files

No response

System information

No response

jfy133 commented 11 months ago

Presumably this is also relatd ot this: https://github.com/nf-core/taxprofiler/issues/395

Where the database name is the same? Either we enforce unique databases or maybe we should change the combine so it binds on both tool and database.

MajoroMask commented 11 months ago

Presumably this is also relatd ot this: #395

Where the database name is the same? Either we enforce unique databases or maybe we should change the combine so it binds on both tool and database.

↓ The --database I'm using in this case ↓ . Column db_name for kaiju and DIAMOND is the same "virus_prot".

tool,db_name,db_params,db_path
kaiju,virus_prot,,/path/to/kaiju_ref.tar.gz
diamond,virus_prot,,/path/to/diamond_ref.dmnd

I gave them the same name simply because I built index for DIAMOND and kaiju from the same .faa sequence reference. So I thought their results may have some kind of comparability. (On second thought, my assumption here is actually against the designed purpose of taxpasta)

I haven't see this part of code thoroughly and I thought meta.db_name is meant to be not unique. So if it is used as a index/key, maybe enforce the uniqueness of meta.db_name is a good idea. So yes, #395 is actually the same issue from this point of view.

Thanks for replying and please forgive my bad writing English. This project helps me a lot so I wish to do my part for it.

jfy133 commented 11 months ago

No, I think your assumption is reasonable!! For example, you might want to compare multiple databases with multiple tools, and you want an easy way to compare them, e.g.

tool,db_name,db_params,db_path
kaiju,virus_prot_big,,/path/to/kaiju_ref.tar.gz
diamond,virus_prot_big,,/path/to/diamond_ref.dmnd
kaiju,virus_prot_small,,/path/to/kaiju_ref.tar.gz
diamond,virus_prot_small,,/path/to/diamond_ref.dmnd

We can discuss on slack how to fix this and release it in a 1.1.2 :)

No worries, we really appreciate you looking so much into the code to understand it, it helps us triage! And your English is completely (probably better than mine, and I'm English 😅)