Increase timeout on `/edges` endpoint?

vincerubinetti commented 6 months ago

We have a user facing issues with the network visualization showing "Error loading edges" in a few cases, when using the more complex model (mlmodel=2).

The first issue is with these three selected genes: https://adage.greenelab.com/genes?model=2&genes=290-1379-4229

I'm not exactly sure why this is happening, but I think this can/should be fixed on the frontend side, because I'm querying the /edge endpoint with a very high limit ?limit=999999. If I reduce this to 999, this particular problem goes away. I will implement this into the frontend, as we realistically can't show that many edges on the page anyway. I assume I used such a high limit to just be consistent with the other endpoints, where we're trying to get "all results".

The second issue is what this gh issue is about. The user is trying to the edge list for this long set of genes:

https://api-adage.greenelab.com/api/v1/edge/?limit=99&mlmodel=2&genes=3311,358,793,1614,1778,1863,1990,1970,2011,2088,5444,2113,1658,2705,608,34,2804,2945,2956,1011,1617,190,3105,3095,3060,4475,4470,554,4508,4515,4661,4628,4652,4608,498,2912,3653,928,5453,2412,11,2717,852,3647,3988,4095,3536,4501,4266,5463,5452,5637,5505,4364,5405,880,5455,723,4505,91,1596,3043,3131,4522,4276,516,596,3204,4397,4629,5529,4640,4597,5280,5281,5218,2229,4935,5135,5129,4758,4759,4767,5349,5363,5356,2952,5061,5084,5056,5327,2934,1413,5273,5270,1257,5354,5198,4982,5184,2950,2951,5296,4831,4842,4844,4850,4875,4884,4909,2416,2267,1403,1404,300,1399,1429,1196,2252,2253,976,977,1561,900,2590,2585,2631,2552,2870,1564,1554,667,1538,2786,2760,2750,2941,2362,2331,2317,1229,1340,1370,1371,1360,1385,1431,1461,1463,1464,1486,1272,1289,578,319,2464,2732,2625,4103,4106,4076,3943,3861,3791,3822,3758,3490,3529,217,3365,3634,3721,3738,4116,4121,4230,2488,2582,311,1060,4580,4016,3949,3832,3669,3586,3559,766,3301,1250,2386,2382,2383,2396

The request results in 502 Bad Gateway after seemingly exactly 30 seconds. I'm fairly sure this is just a timeout; any time it fails, it's always 30 seconds. The backend probably just needs more time to process the information.

If I reduce the list to just the first 5 or so, it returns successfully. I think that certain genes are perhaps more heavily connected and take longer to process, because in certain cases, depending on which genes I remove, I can fit dozens of genes in there and it will take < 30 seconds.

falquaddoomi commented 6 months ago

Thanks for raising this, @vincerubinetti. It's possible those two issues are related, i.e. that the timeout is causing loading a bunch of edges and also requesting the long set of genes. I'll look into it ASAP and report with findings.

falquaddoomi commented 5 months ago

So, it turned out to be Gunicorn's timeout that's a default of 30 seconds. I had a suspicion it was the backend timing out, since nginx's default is 60 seconds, but in any case I've increased both Gunicorn's and the reverse proxy's timeout to 5 minutes. Still testing it out, but expect a PR soon.

falquaddoomi commented 5 months ago

So, I ended up having to increase the timeout to 30 minutes; that second request with the long list of genes took ~17 minutes total to complete. Technically changing the timeout fixes the issue, but it makes me a little uncomfortable to think that it would take just one of these queries per worker to deny service to the API.

Looking into the implementation of /edge/ a bit (https://github.com/greenelab/adage-backend/blob/master/adage/analyses/views.py#L195-L212), it looks like it's nonlinear in the length of the gene list: it finds all the edges that involve each gene, compiles the full set of related genes from those edges, then it looks for the set of all edges in which any of these related genes are mentioned as well. Depending on how well-connected the input genes are, this could produce a large set on which to perform the in operation on both the gene1 and gene2 column on an assumedly large table.

Here are a few options:

we limit the number of genes someone can query for to something that completes in a reasonable time, then perhaps provide alternative options (downloading the database?) for users who need more
we try some indexing techniques to make expanding the edge list more efficient
we just leave things as-is with the 30 minute timeout and hope the server can keep up

vincerubinetti commented 5 months ago

then perhaps provide alternative options (downloading the database?) for users who need more

Dongbo may have left instructions somewhere for how to do this, iirc.

we limit the number of genes someone can query for to something that completes in a reasonable time,

A timeout sort of does this, via just failing and showing an error. It doesn't explain to the user why, though.

we just leave things as-is with the 30 minute timeout and hope the server can keep up

Honestly I would vote for this, given the low number of users that still actively use this app. @cgreene do you have thoughts?

cgreene commented 5 months ago

I think leaving as the 30 min timeout and seeing what happens makes sense. Do we have some monitoring so we can know if it starts to timeout for basic queries b/c the workers are fully consumed?

falquaddoomi commented 5 months ago

Do we have some monitoring so we can know if it starts to timeout for basic queries b/c the workers are fully consumed?

We do; a while back I set up an uptime alert for adage that performs a trivial query, and it should fail if the workers are overloaded.

Let's leave it as-is, then; I'll keep an eye on it and try one of the other solutions if it becomes unresponsive. I'm going to close this issue for now. If it becomes a problem, I'd opt to create a new issue about optimizing or scaling this endpoint.

vincerubinetti commented 5 months ago

Just tested the many gene case above, and it timed out after about 30 min. I feel like if a computation is that complex, we should maybe just say it's not supported by the app?

greenelab / adage-backend

Increase timeout on `/edges` endpoint? #68