ndaniel / fusioncatcher

Finder of Somatic Fusion Genes in RNA-seq data
GNU General Public License v3.0
141 stars 67 forks source link

fusioncatcher-build fails for canis lupus familiaris, with IndexError: list index out of range #182

Open jowkar opened 3 years ago

jowkar commented 3 years ago

This issue is similar to #82 from 2018, except it could not be solved by changing servers, so the root cause might be something different this time. The latest version of FusionCatcher was installed with the recommended method. The error message is the following (the full log file is attached):

Traceback (most recent call last): File "/home/joakim/bin/fusioncatcher_24_02_2021/fusioncatcher/bin/add_custom_gene.py", line 288, in head = database[1] IndexError: list index out of range

stdout.txt

ndaniel commented 3 years ago

Hi jowkar,

what version of FusionCatcher is there?

Cheers, Daniel

jowkar commented 3 years ago

v1.33

ndaniel commented 3 years ago

I am trying to reproduce the bug and let's see. At first glance it looks like Ensembl has changed the organism name from canis_familiaris to canis_lupus_familiaris.

jowkar commented 3 years ago

Yes, they have changed the name. The script does download some files, but some of them, such as exons.txt end up as empty files. On line 285-288 in add_custom_gene.py, the script then tries to read from this file (exons.txt) and gets nothing, resulting in the error, I think.

ndaniel commented 3 years ago

Yes, indeed I can reproduce the bug and it is related to the change from canis_familiaris to canis_lupus_familiaris in Ensembl. Several scripts need to be modified. Soon I will push the changes here in Github but I will not release a new official version of FusionCatcher yet.

Shortly, these two lines

    ense = options.organism.lower().split('_',1)
    ensembl_organism = ense[0][0]+ense[1]+'_gene_ensembl'

should be replaced with these two lines

    ense = options.organism.lower().split('_')
    ensembl_organism = ense[0][0] + ense[1] + '_gene_ensembl' if len(ense) == 2 else ense[0][0] + ense[1][0] + ense[2] + '_gene_ensembl'

in the following files:

jowkar commented 3 years ago

In get_paralogs.py, it seems the following line needs to be changed as well:

org = ense[0][0] + ense[1]

org = ense[0][0] + ense[1] if len(ense) == 2 else ense[0][0] + ense[1][0] + ense[2]

ndaniel commented 3 years ago

@jowkar Indeed, that is correct!

It looks like after these fixes there are still more things to fix.