transcript / samsa2

SAMSA pipeline, version 2.0. An open-source metatranscriptomics pipeline for analyzing microbiome data, built around DIAMOND and customizable reference databases.
GNU General Public License v3.0
56 stars 36 forks source link

Error in DIAMOND_analysis_counter.py #74

Open McKSal opened 2 years ago

McKSal commented 2 years ago

Hello, I am having issues with DIAMOND_analysis_counter.py script I am getting a similar error as in this previous post https://github.com/transcript/samsa2/issues/57

command: python Diamond_analysis_counter2.py -I BMRNA2_other_nr.daa_viewable -D /media/scratch/2022_diamond_nr_db/nr -O BMRNA2_other_nr_organism

error: Now reading through the m8 results infile.

Analysis of BMRNA2_other_nr.daa_viewable complete. Number of total lines: 426637 Number of unique sequences: 422738 Time elapsed: 0.5995767116546631 seconds.

Starting database analysis now. Traceback (most recent call last): File "Diamond_analysis_counter2.py", line 151, in if split_db_org[1] == "sp.": IndexError: list index out of range

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "Diamond_analysis_counter2.py", line 157, in db_org = split_db_org[1] IndexError: list index out of range

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "Diamond_analysis_counter2.py", line 162, in db_org = split_db_org[1] + " " + split_db_org[2] IndexError: list index out of range

From post linked above: "the parsing script doesn't do well when there are multiple instances of square brackets in the line."

When I go in and look at the line (151) all I see is the string of AA's: TREFEAFEAGRRYANTAYLVDLQEMQGDNLLRELVRITAQMNWQLNDLKEQIRQGNVISGQQLALTARQYYEKQLGSLEK

transcript commented 2 years ago

Hi McKSal,

Sure, let's see if I can help. I might need to ask a couple questions and have you try a couple things.

First, don't worry about the line 151 - that's the line in the Python script that's throwing the error, not the line in the input file.

The error that you're getting is when the script is trying to read in the nr database. There's some line in there that's giving it trouble because it can't seem to split it into the ID, organism, and functional names.

I don't have it print out a count by default of which line in the database causes it to error out; that would be a good item for me to add, since it would provide a bit more debugging information. Do you feel comfortable making a couple small edits to the Python script and then running this again?

If so, you could replace lines 161 and 162 in the DIAMOND_analysis_counter.py script with the following:

if db_org[0].isdigit():
    split_db_org = db_org.split()
    try:
        db_org = split_db_org[1] + " " + split_db_org[2]
    except IndexError:
        print(line)
        print(str(db_line_counter))

When you rerun the script, it will still fail in the same place - but this time it will print out the offending line from the database that's causing the issue, as well as the count of which line in the database this is.

That should give me more information so I can recommend a solution.

-Sam