hillerlab / TOGA

TOGA (Tool to infer Orthologs from Genome Alignments): implements a novel paradigm to infer orthologous genes. TOGA integrates gene annotation, inferring orthologs and classifying genes as intact or lost.
MIT License
162 stars 22 forks source link

Index error with plot_mutations.py #148

Open smorzechowski opened 8 months ago

smorzechowski commented 8 months ago

Hello! I am a big fan of this tool, thank you! I want to plot inactivating mutations for various selected transcripts and although I can plot some successfully, others throw an index error.

First, I should note I've had to modify the plot_mutations.py in several places in order to plot anything at all. I had to modify line 711 to append the chain value to the trans_line string: trans_line = line_data[0][2:]+"."+line_data[1], otherwise no lines would be selected from the inact_mut_data.txt file.

    else:  # ordinary text file
        f = open(mut_file, "r")
        trans_related_lines = []
        for line in f:
            if not line.startswith("#"):
                continue
            line_data = line.rstrip().split("\t")
            trans_line = line_data[0][2:]+"."+line_data[1]
            # print(trans_line)
            if trans_line == trans:

Similarly, I had to do the same at line 1025 for trans string trans = defect_line_dat[0]+"."+defect_line_dat[1]:

def generate_filebuffer(mut_lines, human_exons_dct, x, y, drawing_mode=DEFAULT_MODE, verbose=False):
    """Generate the body of our SVG file."""
    print("Generating SVG line") if verbose else None
    curr_projection = None  # name of the current handled species
    curr_sp = None
    speciesdata = {}  # mutation data: {mutationnumber:(exon, start, stop)}
    comp_substrings = []
    filebuffer_substrings = []
    vert_offset = 0
    uniq_projections = set()

    for defect_line, sp_num_tup in mut_lines:
        defect_line_dat = defect_line.rstrip().split("\t")
        trans = defect_line_dat[0]+"."+defect_line_dat[1]
        chain = defect_line_dat[1]

These modifications work with the following files and for some transcripts, such as ENSGALT00000067177.3.23 and NOG_ENSGALT00000004916.2.48. However, for other transcripts, like rna-XM_025148119.1.26806 and rna-XM_025144897.1.149, an index error is thrown:

Traceback (most recent call last):
  File "/n/home09/smorzechowski/bin/TOGA/supply/plot_mutations.py", line 1383, in <module>
    svg_line = make_plot(
  File "/n/home09/smorzechowski/bin/TOGA/supply/plot_mutations.py", line 1349, in make_plot
    filebuffer, up_num, vbox_h = generate_filebuffer(
  File "/n/home09/smorzechowski/bin/TOGA/supply/plot_mutations.py", line 1087, in generate_filebuffer
    exon = exons[2 * exonnumber - 1]
IndexError: list index out of range

I'm not familiar enough with this script, or with python in general, to troubleshoot this further, unfortunately!

Also, if you have any insight as to why the original plot_mutations.py doesn't work with my data, I'd be grateful. The original script doesn't work regardless of whether I use an isoform file and gene id instead of transcript id. Appending the chain number is necessary to select anything from the inact_mut_data.txt file.

Thanks very much for any insight you might have!

smorzechowski commented 8 months ago

Here are files to reproduce:

plot_mutations.py.txt plot_mutations_jobscript.txt subset_inact_mut_data.txt subset_query_annotation.txt

subset_test subset_test2

kirilenkobm commented 8 months ago

Hi @smorzechowski

Thank you for highlighting the issue and providing the code fixes and files to reproduce the issue. When we initially developed the tool, transcript identifiers with numerous dots were not common. Nonetheless, it's clear they shouldn't pose a problem in a properly written script. Once I write an update it will appear in the comments to this issue

MichaelHiller commented 8 months ago

Hi,

this is a very good point. We never encountered this, because we always strip any .1 or something like that from scaffold names. I guess the best approach is to check at the beginning whether gene / transcript names contain '.' or replace them with "_" or another character. Thx for reporting it.

smorzechowski commented 8 months ago

I see, sounds good, thanks to you both!