Generic /association/find route returning 500

cmungall commented 6 years ago

https://api.monarchinitiative.org/api/association/find/?rows=0

deepakunni3 commented 6 years ago

@cmungall The issue here is that the connection times out:

pysolr.SolrError: Connection to server 'https://solr.monarchinitiative.org/solr/golr/select/?q=%2A%3A%2A&facet=on&facet.field=subject_taxon_label&facet.field=object_closure&facet.limit=25&facet.mincount=1&fl=id%2Cis_defined_by%2Csource%2Csubject%2Csubject_label%2Csubject_taxon%2Csubject_taxon_label%2Crelation%2Crelation_label%2Cobject%2Cobject_label%2Cobject_taxon%2Cobject_taxon_label%2Cevidence_object%2Cevidence_graph%2Creference%2Cqualifier%2Cis_redundant_for%2Ctype%2Cevidence%2Cevidence_label%2Cevidence_type%2Cevidence_type_label%2Cevidence_with%2Cevidence_closure%2Cevidence_closure_label%2Cevidence_subset_closure%2Cevidence_subset_closure_label%2Cevidence_type_closure%2Cevidence_type_closure_label&rows=10&wt=json' 
timed out: HTTPSConnectionPool(host='solr.monarchinitiative.org', port=443): 
Read timed out. (read timeout=5)

5 seconds seems insufficient for Monarch Solr to respond to the request sent by this call. Since this is a configurable property, perhaps we can change the timeout to 10 seconds in the config.yaml.

cmungall commented 6 years ago

that's odd, I think rather than increasing the timeout we should explore why a basic query like this takes so long. We do similar things with the GO one all the time

E.g. the call underlying this should be pretty much the same: http://amigo.geneontology.org/amigo/search/annotation

cc @kltm

kltm commented 6 years ago

I'd first check used memory versus size of store on the machine. If you can skype me the underlying machine name, I can check myself.

kltm commented 6 years ago

Okay, I took a quick look at this. Some notes:

the solr index more-or-less given 30G
- I'm not sure if the heap allocation in java will allocate more, but it may well
the machine has had some swapping history
- free -m
the machine is using a rotational disk
- cat /sys/block/sr0/queue/rotational
the index is well north of 250G
load is low

Given all of this, my best guess is that you have insufficient RAM to accomplish the task or that you are so close to the machine RAM limits that you occasionally manage to trigger the need for swap, slowing things down.

My recommendation would be to start by dropping the RAM given to Solr by a couple of G (to prevent possible swapping) and see if things improve. Otherwise, you may want to look in to a beefier machine.

That said, I'm not sure this is the actual cause, but obviously not good either way. More information could be gotten from the logs at the time of the 500 and seeing what it is about the queries around then, and trying to duplicate the issue from the machine.

kltm commented 6 years ago

Tagging @kshefchek

cmungall commented 6 years ago

There is something fishy here. I have queried the monarch solr directly many times and it's fairly responsive, not sure why the basic fetch all documents query would be so slow, it's possible the code is constructing an ultra-dumb query, maybe it's retrieving every single facet value for counts..?

deepakunni3 commented 6 years ago

Here is the query generated by BioLink-API/Ontobio: https://solr.monarchinitiative.org/solr/golr/select/?q=*:*&facet=on&facet.field=subject_taxon_label&facet.field=object_closure&facet.limit=25&facet.mincount=1&fl=id,is_defined_by,source,subject,subject_label,subject_taxon,subject_taxon_label,relation,relation_label,object,object_label,object_taxon,object_taxon_label,evidence_object,evidence_graph,reference,qualifier,is_redundant_for,type,evidence,evidence_label,evidence_type,evidence_type_label,evidence_with,evidence_closure,evidence_closure_label,evidence_subset_closure,evidence_subset_closure_label,evidence_type_closure,evidence_type_closure_label&rows=10&wt=json

There are two closures for facet fields:

subject_taxon_label
object_closure

Do we need facets for object_closure in this query? Removing that from facet.field reduces the query run from 8 seconds to 0.5 second.

cmungall commented 6 years ago

The object closure is generally required if we want to allow faceted browsing on the object (HPO term, GO term, ...). (although at the moment the default upper level terms are too broad to be interesting).

This gives us something more to go on though. The corresponding amigo query includes the equivalent isa_partof_closure. Tweaking one of the solr params related to facets may give more info.

The biolink call could include an option for including the object closure and have this off by default if it's generally expensive, but one of the main reasons to use solr is to get inferred facet counts with ease

kltm commented 6 years ago

It's good to have a test case; it's worrying that it works every time--that should be getting cached. I would note that looking at the above query running on the machine, it seems to peg a couple of CPUs for the duration of the query, but little I/O activity (that I've found so far).

I would still start poking the memory settings here--no matter what, history of swapping would indicate a problem. Also, this is apparently not a real machine (KVM); the CPU here may be hiding the I/O to the real system or similar. It might we worth checking the status of the host machine.

deepakunni3 commented 6 years ago

(although at the moment the default upper level terms are too broad to be interesting)

Yes, thats what made me wonder if its useful to have object_closure facet field. But it definitely makes sense to have in in there, by default, since that would be the expectation for this call.

monarch-initiative / biolink-api

Generic /association/find route returning 500 #191