nf-core / taxprofiler

Highly parallelised multi-taxonomic profiling of shotgun short- and long-read metagenomic data
https://nf-co.re/taxprofiler
MIT License
116 stars 33 forks source link

MetaPhlAn4 full index provide duplicated 'NCBI_tax_id' to taxpasta as input #396

Closed MajoroMask closed 11 months ago

MajoroMask commented 11 months ago

Description of the bug

I'm running the standard test profile with a local MetaPhlan4 database, which is built under the official guidance.

The error message given by taxpasta is rather long (see this gist), and I think the end of it gives me a hint:

ValueError: Index has duplicate keys: CategoricalIndex([165179], categories=[0, 
2, 468, 469, ..., 2003188, 2082587, 2292893, 2887326], ordered=False, 
dtype='category', name='taxonomy_id')

So I'm guessing the input (2612_se_metaphlan4-db.metaphlan_profile.txt in particular) contains multiple rows with the same NCBI tax ID:

cat 2612_se_metaphlan4-db.metaphlan_profile.txt | grep '165179'
k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Prevotellaceae|g__Prevotella|s__Prevotella_copri_clade_A        2|976|200643|171549|171552|838|165179       14.84366
k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Prevotellaceae|g__Prevotella|s__Prevotella_copri_clade_C        2|976|200643|171549|171552|838|165179       2.80509
k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Prevotellaceae|g__Prevotella|s__Prevotella_copri_clade_B        2|976|200643|171549|171552|838|165179       1.44047
k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Prevotellaceae|g__Prevotella|s__Prevotella_copri_clade_F        2|976|200643|171549|171552|838|165179       0.48529
k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Prevotellaceae|g__Prevotella|s__Prevotella_copri_clade_A|t__SGB1626     2|976|200643|171549|171552|838|165179|      14.84366
k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Prevotellaceae|g__Prevotella|s__Prevotella_copri_clade_C|t__SGB1644     2|976|200643|171549|171552|838|165179|      2.80509 k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Prevotellaceae|g__Prevotella|s__Prevotella_sp_TF12_30,k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Prevotellaceae|g__Prevotella|s__Prevotella_sp_AM23_5
k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Prevotellaceae|g__Prevotella|s__Prevotella_copri_clade_B|t__SGB1613     2|976|200643|171549|171552|838|165179|      1.44047
k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Prevotellaceae|g__Prevotella|s__Prevotella_copri_clade_F|t__SGB1614     2|976|200643|171549|171552|838|165179|      0.48529

I checked the document of taxpasta and taxprofiler, but I'm still not sure if this is a bug or I missed-set some argument. Can you guys help me checking this? Any replies will be helpful.

Command used and terminal output

nextflow run ./main.nf -profile test,docker --outdir test/test --databases test_metaphlan4.csv --max_memory '64.GB'

Relevant files

No response

System information

No response

jfy133 commented 11 months ago

Thanks @MajoroMask ! Could you please post this as a taxpasta issue!

It seems in this case that the taxonomy used is not sufficient. Indeed you should only have a single unique taxid for each organism, so it sounds a problem with the taxonomy, but maybe the error message should be clearer.