khyox / recentrifuge

Recentrifuge: robust comparative analysis and contamination removal for metagenomics
http://www.recentrifuge.org
Other
86 stars 7 forks source link

Issue with nodes/names missing unclassified readID (0) #48

Closed Electrocyte closed 1 year ago

Electrocyte commented 1 year ago

Bug report

Bug summary

Nodes/names file are missing an unclassified taxID (0) that is present in the troubleshooting file. I am not sure if this is normal behaviour. A simple fix for this is to edit the "-x 0" out of the command line call. After doing this, I had no issues with running recentrifuge for centrifuge.

Running Recentrifuge

Command line

~/recentrifuge/rcf -c 1 -f $NEGDIR -f $S1 -f $S2 -f $S3 -x 0 -x 9606 -n $NODES -o "$EOUT/samples-out.html" -e "CSV"

Data

Actual outcome

=-= /recentrifuge/rcf =-= v1.2.0 - Sep 2020 =-= by Jose Manuel Martí =-=

Control(s) sample(s) for subtractions:
        /mnt/usersData/DNA/analysis//sample_data/20230106_aDNA_Plain-medium-16S_0CFU_94_12//centrifuge/20230106_aDNA_Plain-medium-16S_0CFU_94_12_v_f_b2_no_host_centrifuge_troubleshooting_report.tsv
Loading NCBI nodes... OK!
Loading NCBI names... OK!
Building dict of parent to children taxa... OK!
List of taxa (and below) to be excluded:
                Id      Scientific Name
Traceback (most recent call last):
  File "/recentrifuge/rcf", line 836, in <module>
    main()
  File "/recentrifuge/rcf", line 759, in main
    ncbi: Taxonomy = Taxonomy(nodesfile, namesfile, plasmidfile,
  File "/recentrifuge/recentrifuge/taxonomy.py", line 63, in __init__
    print(f'\t\t{taxid}\t{self.names[taxid]}')
KeyError: '0'

Expected outcome

Versions

Operating system: Linux
Python version: 3.8.3
Recentrifuge version: 1.12.0
Release of Centrifuge: 1.0.4
Pandas version (if applicable): 1.5.3
Other libraries (if applicable): openpyxl v3.1.x
khyox commented 1 year ago

Hi @Electrocyte and thanks for the detailed bug report! The 0 taxid is not a valid NCBI taxid (see output from https://www.ncbi.nlm.nih.gov/taxonomy/?term=0) but just a way that Centrifuge has to represent unclassified reads in its output. So, you can't exclude those using Recentrifuge's -x argument, but you actually don't want to exclude them: Recentrifuge uses that information to calculate the ratio of classified vs unclassified reads and offer that statistics in the results. Hope this helps.