nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
319 stars 83 forks source link

URL update needed for the latest MEROPs database file #1054

Open calizilla opened 2 months ago

calizilla commented 2 months ago

Are you using the latest release? v 1.8.17

Describe the bug Funannotate pulls down this MEROPs db file which has 5009 genes and was last updated in 2019: https://ftp.ebi.ac.uk/pub/databases/merops/current_release/merops_scan.lib

The latest file (updated 2023) contains 5098 genes and has URL: https://ftp.ebi.ac.uk/pub/databases/merops/current_release/meropsscan.lib

Simple change to line 141 of script funannotate/setupDB.py from:

fasta = os.path.join(FUNDB, 'merops_scan.lib')

to:

fasta = os.path.join(FUNDB, 'meropsscan.lib')

and change line 199 of funannotate/resources.py from:

"merops": "https://ftp.ebi.ac.uk/pub/databases/merops/current_release/merops_scan.lib",

to:

"merops": "https://ftp.ebi.ac.uk/pub/databases/merops/current_release/meropsscan.lib",

will resolve the issue.

hyphaltip commented 2 months ago

I'll make this change but it ultimately looks like a bug/problem with MEROPS release to not use the same file name in the latest release? did you also inform them of this issue - seems like this will bite a lot of people who assume the filename structure would stay same between releases

calizilla commented 2 months ago

@hyphaltip thanks for the fix; and fair point - I just emailed merops@ebi.ac.uk to advise of the issue and suggested they maintain copies at both filenames

hyphaltip commented 1 month ago

I pushed the new version as the default and it required a manual change to the code as the version is hardcoded in the code @nextgenusfs ? we can fix this in funannotate2 - though wish EBI would provide version number as a parseable option in their repository.

calizilla commented 1 month ago

@hyphaltip thanks. I still have not yeard back from EBI regarding the issue on their end.

Just wondering why funannotate chooses to use the meropscan.lib database rather than the pepunit.lib? I have now re-annotated my genomes (one fungus and one plant) using pepunit.lib and obtained far more MEROPs hits against pepunit (see below table). This is the number of unique MEROPs annotated genes, not the total number of hits to the respective database.

meropscan.lib pepunit.lib
plant 1180 2676
fungus 492 1804
hyphaltip commented 1 month ago

that's a jon @nextgenusfs question - he implemented this.

In my own work, if I am doing comparative genomics I end up running my suite of protein domain profiling from the predicted proteins rather than really worrying about the annotation that is part of the final genbank record as I would likely want to run this for the most up-to-date version of DBs. So its good you can get your own results for the DB you want rather than necessarily depending on funannotate for that as these are just added annotations in genbank files.

I'm not familiar with the nuance of these MEROPs files anyways so if you have an explanation of what each provide maybe there is a better one for the general goals of the toolkit here.