ufal / perl-pmltq

Query engine and query language for trees in PML format
3 stars 2 forks source link

SQLEvaluator - evaluation speed #38

Open matyaskopp opened 7 years ago

matyaskopp commented 7 years ago

Dear LINDAT people ,

The following query, submitted by one of my students, leads to a timeout when the timeout is set to (the default value) 30. It works fine (but slowly) when the timeout is set to 200.

nonterminal[cat='VP', nonterminal [cat='NP', functions='', terminal[pos='DT' and form='the']]]
>>count()

The following query, however, which is identical except for the application of the lower() function to the form attribute, runs without any problems within the default timeout limit of 30:

nonterminal[cat='VP', nonterminal [cat='NP', functions='', terminal[pos='DT' and lower(form)='the']]] >>count()

It looks somewhat counterintuitive to me that the first query would require more time than the second one. If anything, I would expect the first one to be faster, since one can directly use an index on the form attribute (if such an index exists).

Also, the progress bar gets to the end, and then starts again in the first query (but not in the second).

Both phenomena suggest to me that there might be some bug or undesirable feature, which is why I report it to you. Or if it is not a bug, can you explain to me why the first query requires more computing time than the second query?

Thanks Jan

Query speeds

Tested on The Penn Treebank 3, WSJ data set

Slower query form='the'

nonterminal[cat='VP', nonterminal [cat='NP', functions='', terminal[pos='DT' and form='the']]] >> count()

SELECT * FROM (SELECT COUNT(*) AS c1_1 FROM (
SELECT DISTINCT
  "n0"."#idx" AS "n0.#idx",
  "n1"."#idx" AS "n1.#idx",
  "n2"."#idx" AS "n2.#idx"
FROM
  "nonterminal" "n0"
   JOIN "nonterminal/cat" "n0/0" ON "n0/0"."#idx" = "n0"."cat"
   JOIN "nonterminal" "n1" ON "n0"."#idx"="n1"."#parent_idx"
   JOIN "nonterminal/cat" "n1/1" ON "n1/1"."#idx" = "n1"."cat"
  LEFT JOIN "pennmrg##9" "n1/2" ON "n1/2"."#idx" = "n1"."functions"
   JOIN "terminal" "n2" ON "n1"."#idx"="n2"."#parent_idx"
WHERE
     (( "n0/0"."#value"  =  'VP' ))
  AND "n0"."#idx"!="n1"."#idx"
  AND (( "n1/1"."#value"  =  'NP' )
    AND ( "n1/2"."#value"  =  ''  OR  "n1/2"."#value"  IS NULL))
  AND ((( "n2"."pos"  =  'DT' )
    AND ( "n2"."form"  =  'the' )))
) "#qnodes"
) "#count" LIMIT 100;

result:

 c1_1 
------
 4899
(1 row)

Time: 9732.483 ms

Faster query lower(form)='the'

nonterminal[cat='VP', nonterminal [cat='NP', functions='', terminal[pos='DT' and lower(form)='the']]]>>count()

SELECT * FROM (SELECT COUNT(*) AS c1_1 FROM (
SELECT DISTINCT
  "n0"."#idx" AS "n0.#idx",
  "n1"."#idx" AS "n1.#idx",
  "n2"."#idx" AS "n2.#idx"
FROM
  "nonterminal" "n0"
   JOIN "nonterminal/cat" "n0/0" ON "n0/0"."#idx" = "n0"."cat"
   JOIN "nonterminal" "n1" ON "n0"."#idx"="n1"."#parent_idx"
   JOIN "nonterminal/cat" "n1/1" ON "n1/1"."#idx" = "n1"."cat"
  LEFT JOIN "pennmrg##9" "n1/2" ON "n1/2"."#idx" = "n1"."functions"
   JOIN "terminal" "n2" ON "n1"."#idx"="n2"."#parent_idx"
WHERE
     (( "n0/0"."#value"  =  'VP' ))
  AND "n0"."#idx"!="n1"."#idx"
  AND (( "n1/1"."#value"  =  'NP' )
    AND ( "n1/2"."#value"  =  ''  OR  "n1/2"."#value"  IS NULL))
  AND ((( "n2"."pos"  =  'DT' )
    AND (LOWER( "n2"."form" ) =  'the' )))
) "#qnodes"
) "#count" LIMIT 100;

result:

 c1_1 
------
 4902
(1 row)

Time: 1845.408 ms

m1ch4ls commented 7 years ago

Can you run the sql queries with EXPLAIN? It should show the execution plan - which in this case might be the problem.

choroba commented 7 years ago

Also, did you run the tests several times to eliminate DB cold start/caching?

matyaskopp commented 7 years ago

Runned with EXPLAIN (ANALYZE TRUE)

form

                                                                                            QUERY PLAN                                                                                            
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=51359.79..51359.81 rows=1 width=8) (actual time=7842.010..7842.011 rows=1 loops=1)
   ->  Aggregate  (cost=51359.79..51359.80 rows=1 width=0) (actual time=7842.008..7842.008 rows=1 loops=1)
         ->  HashAggregate  (cost=51359.77..51359.78 rows=1 width=12) (actual time=7840.322..7841.502 rows=4899 loops=1)
               ->  Nested Loop  (cost=4.01..51359.76 rows=1 width=12) (actual time=700.981..7833.317 rows=4899 loops=1)
                     ->  Nested Loop  (cost=3.58..51353.38 rows=1 width=8) (actual time=691.348..7395.597 rows=59755 loops=1)
                           ->  Nested Loop  (cost=3.15..51347.07 rows=1 width=12) (actual time=691.336..6793.372 rows=167105 loops=1)
                                 ->  Nested Loop  (cost=2.73..51340.77 rows=1 width=16) (actual time=691.223..4373.315 rows=690388 loops=1)
                                       ->  Merge Left Join  (cost=2.30..51334.46 rows=1 width=12) (actual time=691.152..1596.471 rows=738953 loops=1)
                                             Merge Cond: (n1.functions = "n1/2"."#idx")
                                             Filter: ((("n1/2"."#value")::text = ''::text) OR ("n1/2"."#value" IS NULL))
                                             Rows Removed by Filter: 243469
                                             ->  Index Scan using "#i_pennmrg_9" on nonterminal n1  (cost=0.42..39897.52 rows=978873 width=16) (actual time=0.011..959.929 rows=978873 loops=1)
                                             ->  Index Scan using "#i_pennmrg_2" on "pennmrg##9" "n1/2"  (cost=0.42..7656.45 rows=243469 width=8) (actual time=0.007..76.722 rows=243469 loops=1)
                                       ->  Index Scan using nonterminal_pkey on nonterminal n0  (cost=0.42..6.30 rows=1 width=8) (actual time=0.003..0.003 rows=1 loops=738953)
                                             Index Cond: ("#idx" = n1."#parent_idx")
                                             Filter: ("#idx" <> n1."#idx")
                                 ->  Index Scan using "#i_pennmrg_3" on "nonterminal/cat" "n0/0"  (cost=0.42..6.30 rows=1 width=4) (actual time=0.003..0.003 rows=0 loops=690388)
                                       Index Cond: ("#idx" = n0.cat)
                                       Filter: (("#value")::text = 'VP'::text)
                                       Rows Removed by Filter: 1
                           ->  Index Scan using "#i_pennmrg_3" on "nonterminal/cat" "n1/1"  (cost=0.42..6.30 rows=1 width=4) (actual time=0.003..0.003 rows=0 loops=167105)
                                 Index Cond: ("#idx" = n1.cat)
                                 Filter: (("#value")::text = 'NP'::text)
                                 Rows Removed by Filter: 1
                     ->  Index Scan using "#i_pennmrg_17" on terminal n2  (cost=0.43..6.37 rows=1 width=8) (actual time=0.006..0.007 rows=0 loops=59755)
                           Index Cond: ("#parent_idx" = n1."#idx")
>>>FORM>>>>>>              Filter: (((pos)::text = 'DT'::text) AND ((form)::text = 'the'::text))
                           Rows Removed by Filter: 1
 Total runtime: 7842.210 ms

lower(form)

                                                                                      QUERY PLAN                                                                                      
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=48426.30..48426.32 rows=1 width=8) (actual time=1837.483..1837.484 rows=1 loops=1)
   ->  Aggregate  (cost=48426.30..48426.31 rows=1 width=0) (actual time=1837.481..1837.482 rows=1 loops=1)
         ->  HashAggregate  (cost=48426.27..48426.28 rows=1 width=12) (actual time=1835.799..1836.997 rows=4902 loops=1)
               ->  Nested Loop  (cost=2.12..48426.27 rows=1 width=12) (actual time=0.119..1829.437 rows=4902 loops=1)
                     ->  Nested Loop  (cost=1.70..48419.96 rows=1 width=16) (actual time=0.110..1803.340 rows=4909 loops=1)
                           ->  Nested Loop  (cost=1.27..48413.65 rows=1 width=20) (actual time=0.073..1566.176 rows=44169 loops=1)
                                 ->  Nested Loop Left Join  (cost=0.84..48407.34 rows=1 width=16) (actual time=0.064..1334.293 rows=44171 loops=1)
                                       Filter: ((("n1/2"."#value")::text = ''::text) OR ("n1/2"."#value" IS NULL))
                                       Rows Removed by Filter: 15234
                                       ->  Nested Loop  (cost=0.42..45239.32 rows=509 width=20) (actual time=0.053..1209.099 rows=59395 loops=1)
                                             ->  Seq Scan on terminal n2  (cost=0.00..41143.73 rows=509 width=8) (actual time=0.030..732.458 rows=59395 loops=1)
>>>FORM>>>>>>                                      Filter: (((pos)::text = 'DT'::text) AND (lower((form)::text) = 'the'::text))
                                                   Rows Removed by Filter: 1193618
                                             ->  Index Scan using nonterminal_pkey on nonterminal n1  (cost=0.42..8.04 rows=1 width=16) (actual time=0.007..0.007 rows=1 loops=59395)
                                                   Index Cond: ("#idx" = n2."#parent_idx")
                                       ->  Index Scan using "#i_pennmrg_2" on "pennmrg##9" "n1/2"  (cost=0.42..6.21 rows=1 width=8) (actual time=0.001..0.001 rows=0 loops=59395)
                                             Index Cond: ("#idx" = n1.functions)
                                 ->  Index Scan using nonterminal_pkey on nonterminal n0  (cost=0.42..6.30 rows=1 width=8) (actual time=0.004..0.004 rows=1 loops=44171)
                                       Index Cond: ("#idx" = n1."#parent_idx")
                                       Filter: ("#idx" <> n1."#idx")
                           ->  Index Scan using "#i_pennmrg_3" on "nonterminal/cat" "n0/0"  (cost=0.42..6.30 rows=1 width=4) (actual time=0.005..0.005 rows=0 loops=44169)
                                 Index Cond: ("#idx" = n0.cat)
                                 Filter: (("#value")::text = 'VP'::text)
                                 Rows Removed by Filter: 1
                     ->  Index Scan using "#i_pennmrg_3" on "nonterminal/cat" "n1/1"  (cost=0.42..6.30 rows=1 width=4) (actual time=0.002..0.005 rows=1 loops=4909)
                           Index Cond: ("#idx" = n1.cat)
                           Filter: (("#value")::text = 'NP'::text)
                           Rows Removed by Filter: 0
 Total runtime: 1837.597 ms
m1ch4ls commented 7 years ago

I would expect the second query to be slower because it contains Seq Scan - obviously -> the lower(term) values are not indexed.

Try to assign more memory to postgres - maybe the indexes are too large. The 1st query should be quicker. Look at: http://dba.stackexchange.com/questions/18484/tuning-postgresql-for-large-amounts-of-ram#18486

If unsure try: http://pgtune.leopard.in.ua/