gyorilab / indra_cogex

INDRA Context Graph Extension
BSD 2-Clause "Simplified" License
7 stars 9 forks source link

Add version pins to various pyobo calls #169

Open kkaris opened 3 months ago

kkaris commented 3 months ago

I recently ran into a timeout when testing one of the frontend apps at discovery.indra.bio and saw on the backend that the issue was that there were new files being downloaded for pyobo. To resolve this, we can add version pins for various pyobo calls wherever they show up so that there are no downloads triggered at runtime when calls to the various apps come in.

See also: https://github.com/biopragmatics/pyobo/pull/181 and https://github.com/biopragmatics/pyobo/pull/184.

bgyori commented 3 months ago

Could you check which resources specifically are implicated?

kkaris commented 3 months ago

The one I saw in my timeout had to do with ec-codes:

INFO: [2024-06-03 18:15:51] pystow.utils - downloading with urllib from ftp://ftp.expasy.org/databases/enzyme/enzclass.txt to /root/.data/pyobo/raw/eccode/2024-05-29/enzclass.txt
INFO: [2024-06-03 18:15:53] pystow.utils - downloading with urllib from ftp://ftp.expasy.org/databases/enzyme/enzyme.dat to /root/.data/pyobo/raw/eccode/2024-05-29/enzyme.dat
INFO: [2024-06-03 18:15:55] pystow.utils - downloading with urllib from http://current.geneontology.org/ontology/external2go/ec2go to /root/.data/pyobo/raw/eccode/2024-05-29/ec2go.tsv

I'll get a list of all resources that are implicated.

kkaris commented 3 months ago

I'm excluding pyobo calls that are in processors, as they are not used when serving the rest api for the discovery apps. I found two instances:

kkaris commented 3 months ago

Re the EC-codes: The HGNCEnzymeProcessor actually uses the bioontology, so we could either:

  1. Switch the pyobo name lookups in client/enrichment/mla.py to bio-ontology calls instead or
  2. Switch the bioontology call in the HGNCEnzymeProcessor to pyobo calls
  3. Just avoid the name lookup altogether in client/enrichment/mla.py by modifying the query to get the name as well from the node

~I think option 3 makes the most sense, then we always stay consistent with the data in the database.~ There are two use cases there: a) Getting names from ec-codes that exists in CoGEx and b) Getting hgnc ids, translate them to ec-codes, then get the name. In a) we can replace the lookup by simply querying for the name as well, but in b) we still need to get the name. In this case I think option 2 above is the way to go, since that's what we use to create the DB