legumeinfo / jira-issues

placeholder repo for issues migrating from JIRA system, to be moved to their appropriate places later
0 stars 0 forks source link

/admin/tripal/chado/tripal_feature times out #249

Open adf-ncgr opened 9 years ago

adf-ncgr commented 9 years ago

Pooja reported that /admin/tripal/chado/tripal_feature failed to load in lis-dev. After checking /var/log/httpd-error.log, I determined that it's due to a timeout. The page takes a long time to load on lis-stage, but does not time out.

Running "top" while attempting to load the page revealed that postgresql was running the entire time. Executing "SELECT * FROM pg_stat_activity;" as the psql user revealed that the following query is a bottleneck, taking over 18 seconds to run:

SELECT COUNT(*) AS expression
FROM
(SELECT 1 AS expression
FROM
chado.feature feature
LEFT JOIN chado_feature chado_feature ON feature.feature_id = chado_feature.feature_id
LEFT JOIN node node_chado_feature ON chado_feature.nid = node_chado_feature.nid
LEFT JOIN chado.cvterm cvterm ON feature.type_id = cvterm.cvterm_id
LEFT JOIN chado.organism organism ON feature.organism_id = organism.organism_id) subquery;

[LEGUME-281] created by nathanweeks

adf-ncgr commented 9 years ago

Thanks for looking into this Nathan. That query seems a little bizarre to me, and I'm having trouble guessing its function.
I haven't yet been able to track down where in the code it is coming from. My best offhand guess is that it's trying to count
the number of rows returned by the current query (which is not actually restricted when first going to that page) in support
of the 1-x of n rows returned banner. I guess that hypothesis could be tested by turning the latter off and seeing if it helps.
I'll give it a shot on lis-dev, just to see. Hopefully this won't confuse anything you are doing concurrently. I'll wait for a few minutes in case you want to tell me to back off....

by adf_ncgr

adf-ncgr commented 9 years ago

I don't see "1-x of n rows" in the corresponding page on peanutbase-dev? It could also be used to help generate the "1 2 3 4 5 6 7 8 9 … next › last »" links at the bottom.

by nathanweeks

adf-ncgr commented 9 years ago

you're right, I was also just seeing in the admin UI that the default feature view doesn't actually have this enabled (I must have been remembering a customization we made for gene and/or phyloytrees). Your alternative hypothesis seems plausible, though I don't know how to test it offhand.

by adf_ncgr

adf-ncgr commented 9 years ago

actually, there seems to be a way of tweaking the pager. I'm looking at it now, just to see if it changes anything.

by adf_ncgr

adf-ncgr commented 9 years ago

OK, I think I've verified that you are right about it being related to paging. If I set the pager for that view to:
Use pager: Display a specified number of items | 25 items
with an additional setting of
More link: No
then it does not appear to make that odd query, whereas if I turn it to:
More link: Yes
the query is made. FWIW, this additional query is made when this pager is set (regardless of the More link setting):
SELECT node_chado_feature.nid AS node_chado_feature_nid, feature.uniquename AS feature_uniquename, feature.name AS feature_name, cvterm.name AS cvterm_name, organism.common_name AS organism_common_name, feature.seqlen AS feature_seqlen, feature.is_obsolete AS feature_is_obsolete
FROM
chado.feature feature
LEFT JOIN chado_feature chado_feature ON feature.feature_id = chado_feature.feature_id
LEFT JOIN node node_chado_feature ON chado_feature.nid = node_chado_feature.nid
LEFT JOIN chado.cvterm cvterm ON feature.type_id = cvterm.cvterm_id
LEFT JOIN chado.organism organism ON feature.organism_id = organism.organism_id
ORDER BY organism_common_name ASC, cvterm_name ASC, feature_name ASC
LIMIT 25 OFFSET 0

this makes sense, but is still pretty slow presumably due to the ORDER BY.

In any case, I guess this gives us a sense for why it's happening, not sure what a good solution is
at this point, though.

by adf_ncgr

adf-ncgr commented 9 years ago

FYI, I enabled PostgreSQL query logging, with the logs going to /tmp/postgresql-*.log (a new one is created after every restart). Setting log_temp_files=0 to in postgresql.conf to log all temp file creation indicated that >70MB temp files were being created during this query; however, bumping up the value of work_mem from the default 1MB to 512MB actually hurt the performance of this query (~18 seconds -> ~21 seconds).

There is apparently a related slow query that occurs right after the previously-mentioned one:

SELECT node_chado_feature.nid AS node_chado_feature_nid, feature.uniquename AS feature_uniquename, feature.name AS feature_name, cvterm.name AS cvterm_name, organism.common_name AS organism_common_name, feature.seqlen AS feature_seqlen, feature.is_obsolete AS feature_is_obsolete
FROM 
chado.feature feature
LEFT JOIN chado_feature chado_feature ON feature.feature_id = chado_feature.feature_id
LEFT JOIN node node_chado_feature ON chado_feature.nid = node_chado_feature.nid
LEFT JOIN chado.cvterm cvterm ON feature.type_id = cvterm.cvterm_id
LEFT JOIN chado.organism organism ON feature.organism_id = organism.organism_id
ORDER BY organism_common_name ASC, cvterm_name ASC, feature_name ASC
LIMIT 25 OFFSET 0
LOG:  duration: 31947.940 ms

by nathanweeks

adf-ncgr commented 9 years ago

Andrew: could you set "More link: No" for that view? Then we could just worry about the second query.

by nathanweeks

adf-ncgr commented 9 years ago

The more link is currently set to "No" (on dev). Are you still seeing evidence of the first query?
I'm not sure if our comments cross-posted, but I think we're both focusing attention on the same
query; namely, the one that ends with "LIMIT 25 OFFSET 0".

Probably turning of default sorting would help; I'll try it.

by adf_ncgr

adf-ncgr commented 9 years ago

turning off sorting seems to have sped it up considerably. The slow part now is this query:

SELECT cvterm_id, name FROM chado.cvterm WHERE cvterm_id IN (SELECT distinct(type_id) FROM chado.feature) ORDER BY cvterm.name ASC

which I would guess is related to populating the dropdown filter options for type.

so, we have a sort of workaround, although I don't think it is probably where we want to leave it, as the functionality for
the end user is much less. On the other hand, we probably don't want most users using this view anyway. I think the main reason it
was an issue for Pooja is that it was preventing her from getting to the "SYNC" tab in the interface. It looks like there may be another way around this, though (basically, to turn off "auto-preview" of the view results in the admin UI for that module).

I probably have to leave this for now anyway, but I guess we at least have a better understanding of what's going on here...
feel free to continue to experiment

by adf_ncgr

adf-ncgr commented 9 years ago

PS. if you were still seeing the first query, it may have been because I forgot to "Save" the changes I made in the admin UI after previewing them; they are saved now.

by adf_ncgr

adf-ncgr commented 9 years ago

I'm still seeing the first query when I attempt to access /admin/tripal/chado/tripal_feature :

STATEMENT:  SELECT COUNT(*) AS expression
    FROM
    (SELECT 1 AS expression
    FROM
    chado.feature feature
    LEFT JOIN chado_feature chado_feature ON feature.feature_id = chado_feature.feature_id
    LEFT JOIN node node_chado_feature ON chado_feature.nid = node_chado_feature.nid
    LEFT JOIN chado.cvterm cvterm ON feature.type_id = cvterm.cvterm_id
    LEFT JOIN chado.organism organism ON feature.organism_id = organism.organism_id) subquery
LOG:  duration: 17557.314 ms

by nathanweeks

adf-ncgr commented 9 years ago

Good catch regarding the other (I'll call "3rd") query. At ~13 seconds, it's definitely too slow, although not as slow as the "2nd" query.

LOG:  statement: SELECT cvterm_id, name FROM chado.cvterm WHERE cvterm_id IN (SELECT distinct(type_id) FROM chado.feature)  ORDER BY cvterm.name ASC
LOG:  duration: 13236.729 ms

by nathanweeks

adf-ncgr commented 9 years ago

just a couple of minutes between meetings, but with respect to still seeing "2nd" query, I think you are going in
through a different URL than I was. I was using:
/admin/structure/views/view/tripal_feature_user_feature/edit

I got an error when I tried to access the one you used, so probably the change I made only affected the
one that the end user gets, not the admin. I'll try to see if there's a way in to mess with the one you used
(which I suppose is what Pooja was using to get at the SYNC tab).

by adf_ncgr

adf-ncgr commented 9 years ago

looks like you can circumvent the issue by going directly to:
/admin/tripal/chado/tripal_feature/sync

It's not whether the /admin/tripal/chado/tripal_feature page would provide
any control over how the tabular display is handled, but maybe it's a non-issue.

by adf_ncgr

adf-ncgr commented 9 years ago

Nice find; that may suffice for our purposes in the short term. It would still be nice if the original page rendered, as that's going to be an annoyance for us and other Tripal users with lots of data in chado.feature.

It turns out that the primary bottleneck in the 3rd query is the "SELECT DISTINCT":

SELECT distinct(type_id) FROM chado.feature

This is apparently implemented by doing a sequential scan on the chado.feature table:

drupal=> explain SELECT distinct(type_id) FROM chado.feature; 
       QUERY PLAN       
------------------------------------------------------------------------
 HashAggregate  (cost=176753.74..176753.85 rows=11 width=4)
   ->  Seq Scan on feature  (cost=0.00..165521.79 rows=4492779 width=4)

I stumbled upon the following optimization, called a "loose indexscan", which works well when there are many rows containing few distinct values:
https://wiki.postgresql.org/wiki/Loose_indexscan

Using that to implement a faster equivalent to the 3rd query results in a run time of ~5 milliseconds (vs over 13 _seconds_):

LOG:  statement: WITH RECURSIVE t AS (
  SELECT MIN(type_id) AS type_id FROM chado.feature
  UNION ALL
  SELECT (SELECT MIN(type_id) FROM chado.feature WHERE type_id > t.type_id)
  FROM t WHERE t.type_id IS NOT NULL
)
SELECT cvterm_id, name FROM chado.cvterm WHERE cvterm_id IN 
  (SELECT type_id FROM t WHERE type_id IS NOT NULL) ORDER BY cvterm.name ASC;
LOG:  duration: 4.762 ms

I think we should update tripal_views_handler_filter_select_cvterm.inc to use the new query. Would you care to do the honors? I've created a new branch for this (issue/slow-admin-tripal-chado-tripal_feature) on lis-dev. If you're busy, I can give it a whirl.

by nathanweeks

adf-ncgr commented 9 years ago

Also, note that the optimized query uses a common table expression, which I mentioned in our last LIS/PeanutBase meeting, to do the recursive-like queries

by nathanweeks

adf-ncgr commented 9 years ago

Regarding the 1st query: I believe this is equivalent to the following query, which is faster and should be a drop-in replacement:

LOG:  statement: SELECT COUNT(*) FROM chado.feature;
LOG:  duration: 12147.364 ms

PostgreSQL is known to have slow counting:

https://wiki.postgresql.org/wiki/Slow_Counting

However, URL mentions that one can get an estimate from pg_class.reltuples, which is used by the query optimzer:

LOG:  statement: SELECT CAST(reltuples AS NUMERIC) FROM pg_class WHERE oid = 'chado.feature'::regclass;
LOG:  duration: 1.814 ms

I used CAST(reltuples AS NUMERIC) here because the column is a 32-bit float (which can't precisely store an integer in the millions).
Also, I think this value is only updated after an ANALYZE, so it may even be "more" (less?) approximate if ANALYZE hasn't been run for a while:

drupal=> SELECT CAST(reltuples AS NUMERIC) FROM pg_class WHERE oid = 'chado.feature'::regclass;
 reltuples
-----------
   4431230
(1 row)
drupal=> ANALYZE chado.feature;
drupal=> SELECT CAST(reltuples AS NUMERIC) FROM pg_class WHERE oid = 'chado.feature'::regclass;
 reltuples
-----------
   4522270
(1 row)

Do you have an idea if an approximate count would be useful in this context? For display purposes, a displayed count could be prefixed with "approximate".

by nathanweeks

adf-ncgr commented 9 years ago

sorry for the lagged response- looks like you've been busy while I've been in meetings.
I'm honored that you'd trust me to do the honors with trying to hack in that new query but
to be honest I'm a little confused.

In the method you pointed at, it appears that the sql generated
should be following this :
$sql = 'SELECT distinct(cv.cv_id) FROM chado.' . $this->view->base_table
.' LEFT JOIN chado.cvterm cvterm ON cvterm.cvterm_id=' . $this->view->base_table . '.type_id '
.'LEFT JOIN chado.cv cv ON cv.cv_id=cvterm.cv_id';

which is conceptually similar to but not formally identical to the query we saw:
SELECT distinct(type_id) FROM chado.feature

I haven't spent much time looking, but if you can straighten me out easily, it would be appreciated.

by adf_ncgr

adf-ncgr commented 9 years ago

regarding common table expressions, I guess I missed the boat by having used Sybase for too many years;
but I really enjoyed a brief glimpse at the examples given under : Outlandish Recursive Query Examples at
https://www.sqlite.org/lang_with.html#mandelbrot

on a slightly less frivolous note, here's the thread I mentioned regarding a possible recursive querying application in chado
from a discussion some months:
http://sourceforge.net/p/gmod/mailman/gmod-tripal/?viewmonth=201411&viewday=5

by adf_ncgr

adf-ncgr commented 9 years ago

I wish I knew how to use this system to thread the comments.

In any case, regarding the optimizations you suggest for the count query, I think the basic problem is that
the system will be using this same query when counting any result set for the view, including the initial one where there
are no restrictions. So, I think all the joins and such may be needed for the general way it's supposed to work.
I may be misunderstanding though...

by adf_ncgr

adf-ncgr commented 9 years ago

I meant this query in tripal_views_handler_filter_select_cvterm.inc:

$sql = "SELECT cvterm_id, name FROM {cvterm} WHERE cvterm_id IN (SELECT distinct(" . $this->field . ") FROM {" . $this->table . "}) " . $where . ' ORDER BY cvterm.name ASC';

Regarding the count: that sounds like a tougher one to resolve. It might be possible to obviate the need for a separate statement to obtain the count using the pg_stat_statements module, which provides a view containing a row ("rows") that stores the number of rows resulting from a query. I'm not sure if it would work in this case, though, as the relevant (2nd) query contains "LIMIT 25".

We may need to escalate this to the Tripal developers mailing list. We certainly won't be the only ones to encounter this issue.

by nathanweeks

adf-ncgr commented 9 years ago

I took a stab at a fix for the 3rd query:

--- a/tripal_views/views/handlers/tripal_views_handler_filter_select_cvterm.inc
+++ b/tripal_views/views/handlers/tripal_views_handler_filter_select_cvterm.inc
@@ -67,7 +67,7 @@ class tripal_views_handler_filter_select_cvterm extends tripal_views_handler_fil
 $where = ' AND ' . implode(' AND ', $where_clauses);
       }

-      $sql = "SELECT cvterm_id, name FROM {cvterm} WHERE cvterm_id IN (SELECT distinct(" . $this->field . ") FROM {" . $this->table . "}) " . $where . ' ORDER BY cvterm.name ASC';
+      $sql = "WITH RECURSIVE t AS (SELECT MIN(" . $this->field . ") AS col FROM {" . $this->table . "} " . ($where == '' ? '' : "WHERE " . $where) . " UNION ALL SELECT (SELECT MIN(" . $this->field . ") FROM {" . $this->table . "} WHERE " . $this->field . " > col " . $where . ") FROM t WHERE col IS NOT NULL) SELECT cvterm_id, name FROM {cvterm} WHERE cvterm_id IN (SELECT col FROM t where col IS NOT NULL) ORDER BY cvterm.name ASC";
       $resource = chado_query($sql);
       $cvterms = array();

This reduces the overall runtime enough to allow /admin/tripal/chado/tripal_feature to load (after a significant wait). I'll propose this patch to alert Stephen & Lacey to the issue, and get additional sets of eyes on the fix (I'm not 100% sure that I'm handling the WHERE correctly here).

The major bottleneck is still the "2nd" query, which per the postgresql query log causes ~750M of tempo files to be written!

by nathanweeks

adf-ncgr commented 9 years ago

I distilled this discussion into a bug report, and included the possible tripal_views_handler_filter_select_cvterm.inc patch:

https://www.drupal.org/node/2463211

by nathanweeks

adf-ncgr commented 9 years ago

This appears to have been fixed in the tripal code. The feature admin pages come up fine now.

by ecannon

adf-ncgr commented 7 years ago

Peanutbase.org was timing out. There were three instances of the following query that were running concurrently; each had been running for > 2 minutes:

SELECT COUNT(*) AS expression
FROM
(SELECT 1 AS expression
FROM
chado.feature feature
LEFT JOIN chado_feature chado_feature ON feature.feature_id = chado_feature.feature_id
LEFT JOIN node node_chado_feature ON chado_feature.nid = node_chado_feature.nid
LEFT JOIN chado.cvterm cvterm ON feature.type_id = cvterm.cvterm_id
LEFT JOIN chado.organism organism ON feature.organism_id = organism.organism_id
WHERE  ( EXISTS  (SELECT na.nid AS nid
FROM
node_access na
WHERE (( (na.gid = '0') AND (na.realm = 'all') )OR( (na.gid = '1') AND (na.realm = 'nodeaccess_rid') )OR( (na.gid = '0') AND (na.realm = 'nodeaccess_uid') )OR( (na.gid = '0') AND (na.realm = 'nodeaccess_author') ))AND (na.grant_view >= '1') AND (node_chado_feature.nid = na.nid) )) ) subquery

and once instance of the following query that had been running for almost 6 minutes:

 SELECT node_chado_feature.nid AS node_chado_feature_nid, feature.uniquename AS feature_uniquename, feature.name AS feature_name, cvterm.name AS cvterm_name, organism.common_name AS organism_common_name, feature.seqlen AS feature_seqlen, feature.is_obsolete AS feature_is_obsolete
 FROM
 chado.feature feature
 LEFT JOIN chado_feature chado_feature ON feature.feature_id = chado_feature.feature_id
 LEFT JOIN node node_chado_feature ON chado_feature.nid = node_chado_feature.nid
 LEFT JOIN chado.cvterm cvterm ON feature.type_id = cvterm.cvterm_id
 LEFT JOIN chado.organism organism ON feature.organism_id = organism.organism_id
 WHERE  ( EXISTS  (SELECT na.nid AS nid
 FROM
 node_access na
 WHERE (( (na.gid = '0') AND (na.realm = 'all') )OR( (na.gid = '1') AND (na.realm = 'nodeaccess_rid') )OR( (na.gid = '0') AND (na.realm = 'nodeaccess_uid') )OR( (na.gid = '0') AND (na.realm = 'nodeaccess_author') ))AND (na.grant_view >= '1') AND (node_chado_feature.nid = na.nid) ))
 ORDER BY cvterm_name ASC
 LIMIT 25 OFFSET 238875

by nathanweeks

adf-ncgr commented 7 years ago

A thought on this time-out bug: see if it is fixed by tripal 2.1 or tripal 3.0?

The SQL statements themselves, while not speedy, take seconds rather than minutes.

by ecannon