postgrespro / rum

RUM access method - inverted index with additional information in posting lists
Other
720 stars 54 forks source link

int/bigint instead of timestamps #15

Closed NikolayS closed 6 years ago

NikolayS commented 7 years ago

As we discussed at the recent Moscow meetup, for many cases it'd be very good to be able to use RUM to keep int/bigint information and speed up queries like

SELECT ... WHERE tsvector @@ ... ORDER BY price

Similar request from the -general: http://www.postgresql-archive.org/RUM-index-and-support-for-storing-BIGINT-as-part-of-a-tsvector-timestamp-tt5938725.html#a5942457

obartunov commented 7 years ago

On Mon, Apr 3, 2017 at 12:36 PM, Nikolay notifications@github.com wrote:

As we discussed at the recent Moscow meetup, for many cases it'd be very good to be able to use RUM to keep int/bigint information and speed up queries like

SELECT ... WHERE tsvector @@ ... ORDER BY price

Similar request from the -general: http://www.postgresql-archive. org/RUM-index-and-support-for-storing-BIGINT-as-part-of-a- tsvector-timestamp-tt5938725.html#a5942457

yes, it's possible. We probably will implement something like we did for btree_gin - generate a bunch of opclasses for all built-in data types.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/postgrespro/rum/issues/15, or mute the thread https://github.com/notifications/unsubscribe-auth/AGFI4suxoz0xPaHOo1CZnLN4zarOwGyCks5rsSAJgaJpZM4MxyhH .

darthunix commented 7 years ago

And if rum keeps int/bigint information would queries like select id from table where tsvector @@ .. (where id - int value in rum index) still need recheck in heap (except needs of visibility map)?

andreak commented 7 years ago

I'm so glad to see that this will be addressed!

Note that what I asked about in the referred email was not BIGINT instead of TIMESTAMP, but in addition to. Will storing TSVECTOR and both BIGINT and TIMESTAMP (for ordering) in the same RUM-index be possible, so we can have this:

WHERE
    fts_all @@ to_tsquery('simple', 'andre:*&jose:*') 
    AND folder_id = ANY(ARRAY[2,3]::BIGINT[]) 
ORDER BY received_timestamp DESC LIMIT 10

and have it use one index only? If so this will be ultimately coolio!

Thanks.

NikolayS commented 7 years ago

@andreak really good example, I also see lots of similar cases

andreak commented 7 years ago

I'm not quite sure where to post questions about RUM so I've posted another question to -general: http://www.postgresql-archive.org/How-to-include-BIGINT-column-in-RUM-index-sorted-by-timestamp-td5961689.html Is posting to -general the "correct" way to ask questions about RUM? (creating new issues here on github seems strange)

za-arthur commented 7 years ago

I think both ways are appropriate. Oleg Bartunov and me are monitoring -general. Also you can create new issues here.

andreak commented 7 years ago

Ah, good to know, thanks.

feodor commented 7 years ago

Hi!

I checked your example and what I want to say:

ANY clause is only supported by bitmap scan. Obviously, bitmap scan cannot return data in order desired by ORDER BY clause - sort should be performed. If you don't use ANY clause postgres will be able execute all conditions in index:

EXPLAIN (analyze on, COSTS OFF) SELECT

 del.received_timestamp,
 del.received_timestamp <=> '3000-01-01' :: TIMESTAMP,
 del.folder_id

FROM email_delivery del WHERE del.fts_all @@ to_tsquery('simple', 'andre:&jose:') AND del.folder_id = 44965::bigint ORDER BY del.received_timestamp <=> '3000-01-01' :: TIMESTAMP ; PLAN

Index Scan using rum_idx on email_delivery del (actual time=0.919..0.921 rows=1 loops=1) Index Cond: ((fts_all @@ '''andre'': & ''jose'':'::tsquery) AND (folder_id = '44965'::bigint)) Order By: (received_timestamp <=> '3000-01-01 00:00:00'::timestamp without time zone)

But notice, use explicite cast of numeric constant to type of folder_id. In future it could be fixed by adding set of comparison operations with different types on left and right side to RUM's opclasses.

Next example: EXPLAIN (analyze on, COSTS OFF) SELECT

 del.received_timestamp,
 del.received_timestamp <=> '3000-01-01' :: TIMESTAMP,
 del.folder_id

FROM email_delivery del WHERE del.folder_id in ( 44965::bigint, 2470520) ; QUERY PLAN

Bitmap Heap Scan on email_delivery del (actual time=0.208..0.210 rows=1 loops=1) Recheck Cond: (folder_id = ANY ('{44965,2470520}'::bigint[])) Heap Blocks: exact=1 -> Bitmap Index Scan on rum_idx (actual time=0.186..0.186 rows=1 loops=1) Index Cond: (folder_id = ANY ('{44965,2470520}'::bigint[]))

Work as expected, but it's easy to prove that actually it does two index scans in one bitmap. So, it works, but:

set enable_indexscan=off; EXPLAIN (analyze on, COSTS OFF) SELECT del.received_timestamp, del.received_timestamp <=> '3000-01-01' :: TIMESTAMP, del.folder_id FROM email_delivery del WHERE del.fts_all @@ to_tsquery('simple', 'andre:&jose:') AND del.folder_id in ( 44965::bigint, 2470520) ; WARNING: rumbeginscan 1:0 WARNING: rumrescan 1 QUERY PLAN

Bitmap Heap Scan on email_delivery del (actual time=0.108..0.110 rows=1 loops=1) Recheck Cond: (fts_all @@ '''andre'': & ''jose'':'::tsquery) Filter: (folder_id = ANY ('{44965,2470520}'::bigint[])) Heap Blocks: exact=1 -> Bitmap Index Scan on rum_idx (actual time=0.080..0.080 rows=1 loops=1) Index Cond: (fts_all @@ '''andre'': & ''jose'':'::tsquery)

Here I'm not sure why postgres does not push ANY clause to index scan but I suppose it thinks that it's too expensive to do two index scans with fts clause.

Andreas Joseph Krogh wrote:

I'm not quite sure where to post questions about RUM so I've posted another question to -general: http://www.postgresql-archive.org/How-to-include-BIGINT-column-in-RUM-index-sorted-by-timestamp-td5961689.html Is posting to -general the "correct" way to ask questions about RUM? (creating new issues here on github seems strange)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/postgrespro/rum/issues/15#issuecomment-301513930, or mute the thread https://github.com/notifications/unsubscribe-auth/AAz9vN0sMZQvmyOiSYzTOx_FMjblegE3ks5r6HGQgaJpZM4MxyhH.

-- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/

andreak commented 7 years ago

Hi Feodor.

Thanks for the hint about the extra casting, now it works at least with equals on folder_id, as you said:

EXPLAIN ANALYZE
SELECT
    del.entity_id,
    del.received_timestamp,
    del.received_timestamp <=> '3000-01-01' :: TIMESTAMP,
    del.folder_id
FROM origo_email_delivery del
WHERE del.fts_all @@ to_tsquery('simple', 'andre:*&jose:*')
      AND del.folder_id = 2470520::BIGINT
ORDER BY del.received_timestamp <=> '3000-01-01' :: TIMESTAMP
LIMIT 10
┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                                    QUERY PLAN                                                                    │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Limit  (cost=84.00..124.97 rows=10 width=32) (actual time=482.219..482.235 rows=10 loops=1)                                                      │
│   ->  Index Scan using rum_idx on origo_email_delivery del  (cost=84.00..313.40 rows=56 width=32) (actual time=482.218..482.232 rows=10 loops=1) │
│         Index Cond: ((fts_all @@ '''andre'':* & ''jose'':*'::tsquery) AND (folder_id = '2470520'::bigint))                                       │
│         Order By: (received_timestamp <=> '3000-01-01 00:00:00'::timestamp without time zone)                                                    │
│ Planning time: 0.213 ms                                                                                                                          │
│ Execution time: 485.934 ms                                                                                                                       │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Sadly, it doesn't seem to be faster than GIN with Sort-step.

create index origo_email_delivery_fts_all_folder_idx ON origo_email_delivery using gin (fts_all, folder_id)

EXPLAIN ANALYZE
SELECT
    del.entity_id,
    del.received_timestamp,
    del.received_timestamp <=> '3000-01-01' :: TIMESTAMP,
    del.folder_id
FROM origo_email_delivery del
WHERE del.fts_all @@ to_tsquery('simple', 'andre:*&jose:*')
      AND del.folder_id = 2470520::BIGINT
ORDER BY del.received_timestamp DESC
┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                                         QUERY PLAN                                                                          │
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Sort  (cost=307.89..308.03 rows=56 width=32) (actual time=90.567..90.672 rows=2958 loops=1)                                                                 │
│   Sort Key: received_timestamp DESC                                                                                                                         │
│   Sort Method: quicksort  Memory: 328kB                                                                                                                     │
│   ->  Bitmap Heap Scan on origo_email_delivery del  (cost=84.58..306.27 rows=56 width=32) (actual time=86.467..90.067 rows=2958 loops=1)                    │
│         Recheck Cond: ((fts_all @@ '''andre'':* & ''jose'':*'::tsquery) AND (folder_id = '2470520'::bigint))                                                │
│         Heap Blocks: exact=2750                                                                                                                             │
│         ->  Bitmap Index Scan on origo_email_delivery_fts_all_folder_idx  (cost=0.00..84.56 rows=56 width=0) (actual time=86.193..86.193 rows=2958 loops=1) │
│               Index Cond: ((fts_all @@ '''andre'':* & ''jose'':*'::tsquery) AND (folder_id = '2470520'::bigint))                                            │
│ Planning time: 0.409 ms                                                                                                                                     │
│ Execution time: 90.791 ms                                                                                                                                   │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Any comments on why GIN is faster?

feodor commented 7 years ago

EXPLAIN ANALYZE SELECT del.entity_id, del.received_timestamp, del.received_timestamp <=> '3000-01-01' :: TIMESTAMP, del.folder_id FROM origo_email_delivery del WHERE del.fts_all @@ to_tsquery('simple', 'andre:&jose:') AND del.folder_id = 2470520::BIGINT ORDER BY del.received_timestamp <=> '3000-01-01' :: TIMESTAMP LIMIT 10

Sadly, it doesn't seem to be faster than GIN with Sort-step.

I haven't time to deep check today, but, seems, it's because of prefix search in tsquery. Pls, compare without it. When you use prefix search in tsquery, RUM should combine results from several lists (andre, andres, andreas etc) and it could not merge results in correct order. Prefix search also should remove duplicates from result: document could contain more than one matching word. So, RUM actually does sort internally and then removes duplicates. GIN could be faster in this case because bitmap is for effective way for removing than sort and unique.

Will look closer ASAP.

-- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/

andreak commented 7 years ago

Hm, this query (without prefix search) runs forever (using pg-10-master as of 2df537e43fdc432cccbe64de166ac97363cbca3c and RUM as of 9ac931a306e6422aebde141a141583bec1d1828e):

EXPLAIN ANALYZE
SELECT
    del.entity_id,
    del.received_timestamp,
    del.received_timestamp <=> '3000-01-01' :: TIMESTAMP,
    del.folder_id
FROM origo_email_delivery del
WHERE del.fts_all @@ to_tsquery('simple', 'andreas&joseph')
      AND del.folder_id = 2470520::BIGINT
ORDER BY del.received_timestamp <=> '3000-01-01' :: TIMESTAMP
LIMIT 10

Taking 100% CPU. Trying to cancel with CTRL+C fails.

How shuold I go about debugging this?

feodor commented 7 years ago

Andreas Joseph Krogh wrote:

Hm, this query (without prefix search) runs forever (using pg-10-master as of Huh, surprise... How shuold I go about debugging this?

pls, provide several backtraces with gdb

-- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/

andreak commented 7 years ago

Does this help?

Attaching to process 30544
Reading symbols from /home/andreak/programs/postgresql-master/bin/postgres...done.
...
0x0000556d1bfce292 in check_stack_depth () at postgres.c:3103
3103    {
(gdb) bt
#0  0x0000556d1bfce292 in check_stack_depth () at postgres.c:3103
#1  0x0000556d1c0a8a59 in TS_execute (curitem=<optimized out>, arg=0x7ffec5d0f180, flags=2, chkcond=0x7f5595641e30 <pre_checkcondition_rum>) at tsvector_op.c:1819
#2  0x00007f559564374d in rum_tsquery_pre_consistent (fcinfo=<optimized out>) at src/rum_ts_utils.c:190
#3  0x0000556d1c0d85c3 in FunctionCall8Coll (flinfo=<optimized out>, collation=<optimized out>, arg1=<optimized out>, arg2=<optimized out>, arg3=<optimized out>, arg4=<optimized out>, arg5=93927121658784, arg6=140732217226768, arg7=93927121658728, arg8=93927121658912)
    at fmgr.c:1242
#4  0x00007f559564f4bc in preConsistentCheck (so=<optimized out>) at src/rumget.c:1755
#5  scanGetItemFast (scan=<optimized out>, advancePast=0x7ffec5d0f6b6, recheck=0x7ffec5d0f6b6 "", item=0x556d1cb8a298) at src/rumget.c:1858
#6  scanGetItem (advancePast=advancePast@entry=0x556d1cb8a298, item=item@entry=0x556d1cb8a298, recheck=recheck@entry=0x7ffec5d0f6b6 "", scan=<optimized out>) at src/rumget.c:2017
#7  0x00007f559564fb1b in rumgettuple (scan=0x556d1cb83198, direction=ForwardScanDirection) at src/rumget.c:2260
#8  0x0000556d1bd6ff8e in index_getnext_tid (scan=0x556d1cb83198, direction=<optimized out>) at indexam.c:541
#9  0x0000556d1bd70183 in index_getnext (scan=scan@entry=0x556d1cb83198, direction=direction@entry=ForwardScanDirection) at indexam.c:679
#10 0x0000556d1bebc693 in IndexNextWithReorder (node=node@entry=0x556d1cb803f8) at nodeIndexscan.c:269
#11 0x0000556d1beac9e2 in ExecScanFetch (recheckMtd=0x556d1bebc1d0 <IndexRecheck>, accessMtd=0x556d1bebc5e0 <IndexNextWithReorder>, node=0x556d1cb803f8) at execScan.c:95
#12 ExecScan (node=0x556d1cb803f8, accessMtd=0x556d1bebc5e0 <IndexNextWithReorder>, recheckMtd=0x556d1bebc1d0 <IndexRecheck>) at execScan.c:162
#13 0x0000556d1beab858 in ExecProcNode (node=node@entry=0x556d1cb803f8) at execProcnode.c:455
#14 0x0000556d1bebf000 in ExecLimit (node=node@entry=0x556d1cb7fc88) at nodeLimit.c:91
#15 0x0000556d1beab678 in ExecProcNode (node=node@entry=0x556d1cb7fc88) at execProcnode.c:571
#16 0x0000556d1bea6ea6 in ExecutePlan (execute_once=<optimized out>, dest=0x556d1cb6dd90, direction=<optimized out>, numberTuples=0, sendTuples=<optimized out>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x556d1cb7fc88, estate=0x556d1cb7fa78)
    at execMain.c:1706
#17 standard_ExecutorRun (queryDesc=0x556d1cb7b3e8, direction=<optimized out>, count=0, execute_once=<optimized out>) at execMain.c:360
#18 0x0000556d1bfd245c in PortalRunSelect (portal=portal@entry=0x556d1ca854f8, forward=forward@entry=1 '\001', count=0, count@entry=9223372036854775807, dest=dest@entry=0x556d1cb6dd90) at pquery.c:933
#19 0x0000556d1bfd3896 in PortalRun (portal=portal@entry=0x556d1ca854f8, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=1 '\001', run_once=run_once@entry=1 '\001', dest=dest@entry=0x556d1cb6dd90, altdest=altdest@entry=0x556d1cb6dd90, 
    completionTag=0x7ffec5d0faf0 "") at pquery.c:774
#20 0x0000556d1bfcf83a in exec_simple_query (
    query_string=0x556d1caebef8 "SELECT\n    del.entity_id,\n    del.received_timestamp,\n    del.received_timestamp <=> '3000-01-01' :: TIMESTAMP,\n    del.folder_id\nFROM origo_email_delivery del\nWHERE del.fts_all @@ to_tsquery('simple'"...) at postgres.c:1105
#21 0x0000556d1bfd1577 in PostgresMain (argc=<optimized out>, argv=argv@entry=0x556d1ca936c0, dbname=<optimized out>, username=<optimized out>) at postgres.c:4075
#22 0x0000556d1bd27e2f in BackendRun (port=0x556d1ca896b0) at postmaster.c:4334
#23 BackendStartup (port=0x556d1ca896b0) at postmaster.c:4006
#24 ServerLoop () at postmaster.c:1743
#25 0x0000556d1bf65362 in PostmasterMain (argc=3, argv=0x556d1ca65b10) at postmaster.c:1351
#26 0x0000556d1bd28e27 in main (argc=3, argv=0x556d1ca65b10) at main.c:228
(gdb) quit

...wait a bit...

Attaching to process 30544
Reading symbols from /home/andreak/programs/postgresql-master/bin/postgres...done.
...
pre_checkcondition_rum (checkval=0x7ffec5d0f180, val=<optimized out>, data=<optimized out>) at src/rum_ts_utils.c:163
163     }
(gdb) bt
#0  pre_checkcondition_rum (checkval=0x7ffec5d0f180, val=<optimized out>, data=<optimized out>) at src/rum_ts_utils.c:163
#1  0x0000556d1c0a8af8 in TS_execute (curitem=<optimized out>, arg=0x7ffec5d0f180, flags=2, chkcond=0x7f5595641e30 <pre_checkcondition_rum>) at tsvector_op.c:1834
#2  0x00007f559564374d in rum_tsquery_pre_consistent (fcinfo=<optimized out>) at src/rum_ts_utils.c:190
#3  0x0000556d1c0d85c3 in FunctionCall8Coll (flinfo=<optimized out>, collation=<optimized out>, arg1=<optimized out>, arg2=<optimized out>, arg3=<optimized out>, arg4=<optimized out>, arg5=93927121658784, arg6=140732217226768, arg7=93927121658728, arg8=93927121658912)
    at fmgr.c:1242
#4  0x00007f559564f4bc in preConsistentCheck (so=<optimized out>) at src/rumget.c:1755
#5  scanGetItemFast (scan=<optimized out>, advancePast=0x7ffec5d0f6b6, recheck=0x7ffec5d0f6b6 "", item=0x556d1cb8a298) at src/rumget.c:1858
#6  scanGetItem (advancePast=advancePast@entry=0x556d1cb8a298, item=item@entry=0x556d1cb8a298, recheck=recheck@entry=0x7ffec5d0f6b6 "", scan=<optimized out>) at src/rumget.c:2017
#7  0x00007f559564fb1b in rumgettuple (scan=0x556d1cb83198, direction=ForwardScanDirection) at src/rumget.c:2260
#8  0x0000556d1bd6ff8e in index_getnext_tid (scan=0x556d1cb83198, direction=<optimized out>) at indexam.c:541
#9  0x0000556d1bd70183 in index_getnext (scan=scan@entry=0x556d1cb83198, direction=direction@entry=ForwardScanDirection) at indexam.c:679
#10 0x0000556d1bebc693 in IndexNextWithReorder (node=node@entry=0x556d1cb803f8) at nodeIndexscan.c:269
#11 0x0000556d1beac9e2 in ExecScanFetch (recheckMtd=0x556d1bebc1d0 <IndexRecheck>, accessMtd=0x556d1bebc5e0 <IndexNextWithReorder>, node=0x556d1cb803f8) at execScan.c:95
#12 ExecScan (node=0x556d1cb803f8, accessMtd=0x556d1bebc5e0 <IndexNextWithReorder>, recheckMtd=0x556d1bebc1d0 <IndexRecheck>) at execScan.c:162
#13 0x0000556d1beab858 in ExecProcNode (node=node@entry=0x556d1cb803f8) at execProcnode.c:455
#14 0x0000556d1bebf000 in ExecLimit (node=node@entry=0x556d1cb7fc88) at nodeLimit.c:91
#15 0x0000556d1beab678 in ExecProcNode (node=node@entry=0x556d1cb7fc88) at execProcnode.c:571
#16 0x0000556d1bea6ea6 in ExecutePlan (execute_once=<optimized out>, dest=0x556d1cb6dd90, direction=<optimized out>, numberTuples=0, sendTuples=<optimized out>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x556d1cb7fc88, estate=0x556d1cb7fa78)
    at execMain.c:1706
#17 standard_ExecutorRun (queryDesc=0x556d1cb7b3e8, direction=<optimized out>, count=0, execute_once=<optimized out>) at execMain.c:360
#18 0x0000556d1bfd245c in PortalRunSelect (portal=portal@entry=0x556d1ca854f8, forward=forward@entry=1 '\001', count=0, count@entry=9223372036854775807, dest=dest@entry=0x556d1cb6dd90) at pquery.c:933
#19 0x0000556d1bfd3896 in PortalRun (portal=portal@entry=0x556d1ca854f8, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=1 '\001', run_once=run_once@entry=1 '\001', dest=dest@entry=0x556d1cb6dd90, altdest=altdest@entry=0x556d1cb6dd90,
    completionTag=0x7ffec5d0faf0 "") at pquery.c:774
#20 0x0000556d1bfcf83a in exec_simple_query (
    query_string=0x556d1caebef8 "SELECT\n    del.entity_id,\n    del.received_timestamp,\n    del.received_timestamp <=> '3000-01-01' :: TIMESTAMP,\n    del.folder_id\nFROM origo_email_delivery del\nWHERE del.fts_all @@ to_tsquery('simple'"...) at postgres.c:1105
#21 0x0000556d1bfd1577 in PostgresMain (argc=<optimized out>, argv=argv@entry=0x556d1ca936c0, dbname=<optimized out>, username=<optimized out>) at postgres.c:4075
#22 0x0000556d1bd27e2f in BackendRun (port=0x556d1ca896b0) at postmaster.c:4334
#23 BackendStartup (port=0x556d1ca896b0) at postmaster.c:4006
#24 ServerLoop () at postmaster.c:1743
#25 0x0000556d1bf65362 in PostmasterMain (argc=3, argv=0x556d1ca65b10) at postmaster.c:1351
#26 0x0000556d1bd28e27 in main (argc=3, argv=0x556d1ca65b10) at main.c:228
(gdb)

...wait a bit more...

Attaching to process 30544
Reading symbols from /home/andreak/programs/postgresql-master/bin/postgres...done.
...
preConsistentCheck (so=<optimized out>) at src/rumget.c:1743
1743                    for (i = 0; i < key->nentries; i++)
(gdb) bt
#0  preConsistentCheck (so=<optimized out>) at src/rumget.c:1743
#1  scanGetItemFast (scan=<optimized out>, advancePast=0x7ffec5d0f6b6, recheck=0x7ffec5d0f6b6 "", item=0x556d1cb8a298) at src/rumget.c:1858
#2  scanGetItem (advancePast=advancePast@entry=0x556d1cb8a298, item=item@entry=0x556d1cb8a298, recheck=recheck@entry=0x7ffec5d0f6b6 "", scan=<optimized out>) at src/rumget.c:2017
#3  0x00007f559564fb1b in rumgettuple (scan=0x556d1cb83198, direction=ForwardScanDirection) at src/rumget.c:2260
#4  0x0000556d1bd6ff8e in index_getnext_tid (scan=0x556d1cb83198, direction=<optimized out>) at indexam.c:541
#5  0x0000556d1bd70183 in index_getnext (scan=scan@entry=0x556d1cb83198, direction=direction@entry=ForwardScanDirection) at indexam.c:679
#6  0x0000556d1bebc693 in IndexNextWithReorder (node=node@entry=0x556d1cb803f8) at nodeIndexscan.c:269
#7  0x0000556d1beac9e2 in ExecScanFetch (recheckMtd=0x556d1bebc1d0 <IndexRecheck>, accessMtd=0x556d1bebc5e0 <IndexNextWithReorder>, node=0x556d1cb803f8) at execScan.c:95
#8  ExecScan (node=0x556d1cb803f8, accessMtd=0x556d1bebc5e0 <IndexNextWithReorder>, recheckMtd=0x556d1bebc1d0 <IndexRecheck>) at execScan.c:162
#9  0x0000556d1beab858 in ExecProcNode (node=node@entry=0x556d1cb803f8) at execProcnode.c:455
#10 0x0000556d1bebf000 in ExecLimit (node=node@entry=0x556d1cb7fc88) at nodeLimit.c:91
#11 0x0000556d1beab678 in ExecProcNode (node=node@entry=0x556d1cb7fc88) at execProcnode.c:571
#12 0x0000556d1bea6ea6 in ExecutePlan (execute_once=<optimized out>, dest=0x556d1cb6dd90, direction=<optimized out>, numberTuples=0, sendTuples=<optimized out>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x556d1cb7fc88, estate=0x556d1cb7fa78)
    at execMain.c:1706
#13 standard_ExecutorRun (queryDesc=0x556d1cb7b3e8, direction=<optimized out>, count=0, execute_once=<optimized out>) at execMain.c:360
#14 0x0000556d1bfd245c in PortalRunSelect (portal=portal@entry=0x556d1ca854f8, forward=forward@entry=1 '\001', count=0, count@entry=9223372036854775807, dest=dest@entry=0x556d1cb6dd90) at pquery.c:933
#15 0x0000556d1bfd3896 in PortalRun (portal=portal@entry=0x556d1ca854f8, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=1 '\001', run_once=run_once@entry=1 '\001', dest=dest@entry=0x556d1cb6dd90, altdest=altdest@entry=0x556d1cb6dd90,
    completionTag=0x7ffec5d0faf0 "") at pquery.c:774
#16 0x0000556d1bfcf83a in exec_simple_query (
    query_string=0x556d1caebef8 "SELECT\n    del.entity_id,\n    del.received_timestamp,\n    del.received_timestamp <=> '3000-01-01' :: TIMESTAMP,\n    del.folder_id\nFROM origo_email_delivery del\nWHERE del.fts_all @@ to_tsquery('simple'"...) at postgres.c:1105
#17 0x0000556d1bfd1577 in PostgresMain (argc=<optimized out>, argv=argv@entry=0x556d1ca936c0, dbname=<optimized out>, username=<optimized out>) at postgres.c:4075
#18 0x0000556d1bd27e2f in BackendRun (port=0x556d1ca896b0) at postmaster.c:4334
#19 BackendStartup (port=0x556d1ca896b0) at postmaster.c:4006
#20 ServerLoop () at postmaster.c:1743
#21 0x0000556d1bf65362 in PostmasterMain (argc=3, argv=0x556d1ca65b10) at postmaster.c:1351
#22 0x0000556d1bd28e27 in main (argc=3, argv=0x556d1ca65b10) at main.c:228

...wait a bit more...

Attaching to process 30544
Reading symbols from /home/andreak/programs/postgresql-master/bin/postgres...done.
...
scanGetItemFast (scan=<optimized out>, advancePast=0x7ffec5d0f6b6, recheck=0x7ffec5d0f6b6 "", item=0x556d1cb8a298) at src/rumget.c:1855
1855                                    for (; j < i; j++)
(gdb) bt
#0  scanGetItemFast (scan=<optimized out>, advancePast=0x7ffec5d0f6b6, recheck=0x7ffec5d0f6b6 "", item=0x556d1cb8a298) at src/rumget.c:1855
#1  scanGetItem (advancePast=advancePast@entry=0x556d1cb8a298, item=item@entry=0x556d1cb8a298, recheck=recheck@entry=0x7ffec5d0f6b6 "", scan=<optimized out>) at src/rumget.c:2017
#2  0x00007f559564fb1b in rumgettuple (scan=0x556d1cb83198, direction=ForwardScanDirection) at src/rumget.c:2260
#3  0x0000556d1bd6ff8e in index_getnext_tid (scan=0x556d1cb83198, direction=<optimized out>) at indexam.c:541
#4  0x0000556d1bd70183 in index_getnext (scan=scan@entry=0x556d1cb83198, direction=direction@entry=ForwardScanDirection) at indexam.c:679
#5  0x0000556d1bebc693 in IndexNextWithReorder (node=node@entry=0x556d1cb803f8) at nodeIndexscan.c:269
#6  0x0000556d1beac9e2 in ExecScanFetch (recheckMtd=0x556d1bebc1d0 <IndexRecheck>, accessMtd=0x556d1bebc5e0 <IndexNextWithReorder>, node=0x556d1cb803f8) at execScan.c:95
#7  ExecScan (node=0x556d1cb803f8, accessMtd=0x556d1bebc5e0 <IndexNextWithReorder>, recheckMtd=0x556d1bebc1d0 <IndexRecheck>) at execScan.c:162
#8  0x0000556d1beab858 in ExecProcNode (node=node@entry=0x556d1cb803f8) at execProcnode.c:455
#9  0x0000556d1bebf000 in ExecLimit (node=node@entry=0x556d1cb7fc88) at nodeLimit.c:91
#10 0x0000556d1beab678 in ExecProcNode (node=node@entry=0x556d1cb7fc88) at execProcnode.c:571
#11 0x0000556d1bea6ea6 in ExecutePlan (execute_once=<optimized out>, dest=0x556d1cb6dd90, direction=<optimized out>, numberTuples=0, sendTuples=<optimized out>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x556d1cb7fc88, estate=0x556d1cb7fa78)
    at execMain.c:1706
#12 standard_ExecutorRun (queryDesc=0x556d1cb7b3e8, direction=<optimized out>, count=0, execute_once=<optimized out>) at execMain.c:360
#13 0x0000556d1bfd245c in PortalRunSelect (portal=portal@entry=0x556d1ca854f8, forward=forward@entry=1 '\001', count=0, count@entry=9223372036854775807, dest=dest@entry=0x556d1cb6dd90) at pquery.c:933
#14 0x0000556d1bfd3896 in PortalRun (portal=portal@entry=0x556d1ca854f8, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=1 '\001', run_once=run_once@entry=1 '\001', dest=dest@entry=0x556d1cb6dd90, altdest=altdest@entry=0x556d1cb6dd90,
    completionTag=0x7ffec5d0faf0 "") at pquery.c:774
#15 0x0000556d1bfcf83a in exec_simple_query (
    query_string=0x556d1caebef8 "SELECT\n    del.entity_id,\n    del.received_timestamp,\n    del.received_timestamp <=> '3000-01-01' :: TIMESTAMP,\n    del.folder_id\nFROM origo_email_delivery del\nWHERE del.fts_all @@ to_tsquery('simple'"...) at postgres.c:1105
#16 0x0000556d1bfd1577 in PostgresMain (argc=<optimized out>, argv=argv@entry=0x556d1ca936c0, dbname=<optimized out>, username=<optimized out>) at postgres.c:4075
#17 0x0000556d1bd27e2f in BackendRun (port=0x556d1ca896b0) at postmaster.c:4334
#18 BackendStartup (port=0x556d1ca896b0) at postmaster.c:4006
#19 ServerLoop () at postmaster.c:1743
#20 0x0000556d1bf65362 in PostmasterMain (argc=3, argv=0x556d1ca65b10) at postmaster.c:1351
#21 0x0000556d1bd28e27 in main (argc=3, argv=0x556d1ca65b10) at main.c:228
(gdb)

...wait a bit more...

Attaching to process 30544
Reading symbols from /home/andreak/programs/postgresql-master/bin/postgres...done.
...
0x0000556d1bfce33e in check_stack_depth () at postgres.c:3113
3113    }
(gdb) bt
#0  0x0000556d1bfce33e in check_stack_depth () at postgres.c:3113
#1  0x0000556d1c0a8a59 in TS_execute (curitem=<optimized out>, arg=0x7ffec5d0f180, flags=2, chkcond=0x7f5595641e30 <pre_checkcondition_rum>) at tsvector_op.c:1819
#2  0x00007f559564374d in rum_tsquery_pre_consistent (fcinfo=<optimized out>) at src/rum_ts_utils.c:190
#3  0x0000556d1c0d85c3 in FunctionCall8Coll (flinfo=<optimized out>, collation=<optimized out>, arg1=<optimized out>, arg2=<optimized out>, arg3=<optimized out>, arg4=<optimized out>, arg5=93927121658784, arg6=140732217226768, arg7=93927121658728, arg8=93927121658912)
    at fmgr.c:1242
#4  0x00007f559564f4bc in preConsistentCheck (so=<optimized out>) at src/rumget.c:1755
#5  scanGetItemFast (scan=<optimized out>, advancePast=0x7ffec5d0f6b6, recheck=0x7ffec5d0f6b6 "", item=0x556d1cb8a298) at src/rumget.c:1858
#6  scanGetItem (advancePast=advancePast@entry=0x556d1cb8a298, item=item@entry=0x556d1cb8a298, recheck=recheck@entry=0x7ffec5d0f6b6 "", scan=<optimized out>) at src/rumget.c:2017
#7  0x00007f559564fb1b in rumgettuple (scan=0x556d1cb83198, direction=ForwardScanDirection) at src/rumget.c:2260
#8  0x0000556d1bd6ff8e in index_getnext_tid (scan=0x556d1cb83198, direction=<optimized out>) at indexam.c:541
#9  0x0000556d1bd70183 in index_getnext (scan=scan@entry=0x556d1cb83198, direction=direction@entry=ForwardScanDirection) at indexam.c:679
#10 0x0000556d1bebc693 in IndexNextWithReorder (node=node@entry=0x556d1cb803f8) at nodeIndexscan.c:269
#11 0x0000556d1beac9e2 in ExecScanFetch (recheckMtd=0x556d1bebc1d0 <IndexRecheck>, accessMtd=0x556d1bebc5e0 <IndexNextWithReorder>, node=0x556d1cb803f8) at execScan.c:95
#12 ExecScan (node=0x556d1cb803f8, accessMtd=0x556d1bebc5e0 <IndexNextWithReorder>, recheckMtd=0x556d1bebc1d0 <IndexRecheck>) at execScan.c:162
#13 0x0000556d1beab858 in ExecProcNode (node=node@entry=0x556d1cb803f8) at execProcnode.c:455
#14 0x0000556d1bebf000 in ExecLimit (node=node@entry=0x556d1cb7fc88) at nodeLimit.c:91
#15 0x0000556d1beab678 in ExecProcNode (node=node@entry=0x556d1cb7fc88) at execProcnode.c:571
#16 0x0000556d1bea6ea6 in ExecutePlan (execute_once=<optimized out>, dest=0x556d1cb6dd90, direction=<optimized out>, numberTuples=0, sendTuples=<optimized out>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x556d1cb7fc88, estate=0x556d1cb7fa78)
    at execMain.c:1706
#17 standard_ExecutorRun (queryDesc=0x556d1cb7b3e8, direction=<optimized out>, count=0, execute_once=<optimized out>) at execMain.c:360
#18 0x0000556d1bfd245c in PortalRunSelect (portal=portal@entry=0x556d1ca854f8, forward=forward@entry=1 '\001', count=0, count@entry=9223372036854775807, dest=dest@entry=0x556d1cb6dd90) at pquery.c:933
#19 0x0000556d1bfd3896 in PortalRun (portal=portal@entry=0x556d1ca854f8, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=1 '\001', run_once=run_once@entry=1 '\001', dest=dest@entry=0x556d1cb6dd90, altdest=altdest@entry=0x556d1cb6dd90,
    completionTag=0x7ffec5d0faf0 "") at pquery.c:774
#20 0x0000556d1bfcf83a in exec_simple_query (
    query_string=0x556d1caebef8 "SELECT\n    del.entity_id,\n    del.received_timestamp,\n    del.received_timestamp <=> '3000-01-01' :: TIMESTAMP,\n    del.folder_id\nFROM origo_email_delivery del\nWHERE del.fts_all @@ to_tsquery('simple'"...) at postgres.c:1105
#21 0x0000556d1bfd1577 in PostgresMain (argc=<optimized out>, argv=argv@entry=0x556d1ca936c0, dbname=<optimized out>, username=<optimized out>) at postgres.c:4075
#22 0x0000556d1bd27e2f in BackendRun (port=0x556d1ca896b0) at postmaster.c:4334
#23 BackendStartup (port=0x556d1ca896b0) at postmaster.c:4006
#24 ServerLoop () at postmaster.c:1743
#25 0x0000556d1bf65362 in PostmasterMain (argc=3, argv=0x556d1ca65b10) at postmaster.c:1351
#26 0x0000556d1bd28e27 in main (argc=3, argv=0x556d1ca65b10) at main.c:228
(gdb)

Hope this helps

za-arthur commented 7 years ago

Thank you for the information! Can you check your endless query again with new commit.

andreak commented 7 years ago

Works perfectly, thanks!

andreak commented 7 years ago

To answer Teodor's question; The query without prefix-search is 10 times faster: with prefix:

EXPLAIN ANALYZE
SELECT
    del.entity_id,
    del.received_timestamp,
    del.received_timestamp <=> '3000-01-01' :: TIMESTAMP,
    del.folder_id
FROM origo_email_delivery del
WHERE del.fts_all @@ to_tsquery('simple', 'andre:*&jose:*')
      AND del.folder_id = 2470520::BIGINT
ORDER BY del.received_timestamp <=> '3000-01-01' :: TIMESTAMP
LIMIT 10
┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                                    QUERY PLAN                                                                    │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Limit  (cost=84.00..124.98 rows=10 width=32) (actual time=433.761..433.776 rows=10 loops=1)                                                      │
│   ->  Index Scan using rum_idx on origo_email_delivery del  (cost=84.00..309.38 rows=55 width=32) (actual time=433.759..433.773 rows=10 loops=1) │
│         Index Cond: ((fts_all @@ '''andre'':* & ''jose'':*'::tsquery) AND (folder_id = '2470520'::bigint))                                       │
│         Order By: (received_timestamp <=> '3000-01-01 00:00:00'::timestamp without time zone)                                                    │
│ Planning time: 0.671 ms                                                                                                                          │
│ Execution time: 436.307 ms                                                                                                                       │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

without prefix:

EXPLAIN ANALYZE
SELECT
    del.entity_id,
    del.received_timestamp,
    del.received_timestamp <=> '3000-01-01' :: TIMESTAMP,
    del.folder_id
FROM origo_email_delivery del
WHERE del.fts_all @@ to_tsquery('simple', 'andreas&joseph')
      AND del.folder_id = 2470520::BIGINT
ORDER BY del.received_timestamp <=> '3000-01-01' :: TIMESTAMP
LIMIT 10
┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                                   QUERY PLAN                                                                   │
├────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Limit  (cost=72.00..114.16 rows=10 width=32) (actual time=43.099..43.113 rows=10 loops=1)                                                      │
│   ->  Index Scan using rum_idx on origo_email_delivery del  (cost=72.00..160.53 rows=21 width=32) (actual time=43.097..43.110 rows=10 loops=1) │
│         Index Cond: ((fts_all @@ '''andreas'' & ''joseph'''::tsquery) AND (folder_id = '2470520'::bigint))                                     │
│         Order By: (received_timestamp <=> '3000-01-01 00:00:00'::timestamp without time zone)                                                  │
│ Planning time: 0.684 ms                                                                                                                        │
│ Execution time: 43.159 ms                                                                                                                      │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

The question now of course is will it ever be possible, theoretically, for the prefix-variant to be faster, or is there some fundamental limit or logic-barrier preventing it from ever will be?

za-arthur commented 7 years ago

Great!

About your question. It is because of algorithm of scanning by prefix query. Maybe it could be improved. We need to investigate it.

andreak commented 7 years ago

Thanks.

I think this issue can be closed now, and any other issues/questions I might have can be asked in new issues?

za-arthur commented 7 years ago

You are right. It is better to create new issues. Close.

za-arthur commented 7 years ago

Actually, the initial issue from NikolayS doesn't fixed. Reopen.

NikolayS commented 7 years ago

So is it possible now to filter with @@ operator and ORDER BY some integer/int8 column and have only single index scan for that?

NikolayS commented 7 years ago

Thanks:)

za-arthur commented 7 years ago

Yes, it will be possible in soon :)

za-arthur commented 7 years ago

As we discussed at the recent Moscow meetup, for many cases it'd be very good to be able to use RUM to keep int/bigint information and speed up queries like

SELECT ... WHERE tsvector @@ ... ORDER BY price Similar request from the -general: http://www.postgresql-archive.org/RUM-index-and-support-for-storing-BIGINT-as-part-of-a-tsvector-timestamp-tt5938725.html#a5942457

@NikolayS , actually you can use the following query. If you created index in the following way:

create index on tab1 using rum (tsv rum_tsvector_addon_ops, price)
with (attach=price, to=tsv);

you can execute the following query:

select ... from tab1 where tsv @@ ...
order by price <=> 10000::money limit 10;

It will use index scan. Some distance operator (for example, <=>) should be used to use index scan. Otherwise, bitmap scan will be used.

andreak commented 7 years ago

Hi. While I very much appreciate that using a distance-operator in ORDER BY works, doing so, instead of ordenary ORDER BY <column|epr> [ASC|DESC], seems awkward as it requires a magic-value higher or lower than any other value. In my case I must use ORDER BY del.received_timestamp <=> '3000-01-01' :: TIMESTAMP, which uses 3000 as I know I don't have any timestamps in the year 3000 in my dataset.

Will this be re-worked in future releases so one can write ORDER BY del.timestamp DESC and having it use the index in the same manner as using a distance operator?

andreak commented 6 years ago

Isn't this issue actually fixed now, using rum_tsvector_addon_ops?

za-arthur commented 6 years ago

Yes, you can use rum_tsvector_addon_ops now. Thank you for noticing.

NikolayS commented 6 years ago

ok, thank you, but could you please elaborate how it's supposed to work?

[local]:5432 nikolay@test=# select version();
                                         version
------------------------------------------------------------------------------------------
 PostgreSQL 9.6.6 on x86_64-pc-linux-gnu, compiled by gcc (Debian 4.9.2-10) 4.9.2, 64-bit
(1 row)
[local]:5432 nikolay@test=# select * from pg_available_extensions where name = 'rum';
 name | default_version | installed_version |         comment
------+-----------------+-------------------+-------------------------
 rum  | 1.2             | 1.2               | RUM index access method
(1 row)

[local]:5432 nikolay@test=# create table rum(id int primary key, tsvector tsvector);
CREATE TABLE
Time: 140.727 ms
[local]:5432 nikolay@test=# insert into rum(id, tsvector) select i, to_tsvector('simple', 'test vector ' || i::text) from generate_series(1, 100000) _(i);
INSERT 0 100000
Time: 735.009 ms
[local]:5432 nikolay@test=# insert into rum(id, tsvector) select i, to_tsvector('simple', 'test cow ' || i::text) from generate_series(100001, 150000) _(i);
INSERT 0 50000
Time: 499.296 ms
[local]:5432 nikolay@test=# create index on rum using rum (tsvector rum_tsvector_addon_ops, id) with (attach=id, to=tsvector);
CREATE INDEX
Time: 1150.367 ms (00:01.150)
[local]:5432 nikolay@test=# explain analyze select * from rum where tsvector @@ plainto_tsquery('simple', 'cow') order by id desc nulls last limit 10;
                                                                   QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=633.33..633.36 rows=10 width=48) (actual time=41.625..41.627 rows=10 loops=1)
   ->  Sort  (cost=633.33..635.21 rows=750 width=48) (actual time=41.623..41.624 rows=10 loops=1)
         Sort Key: id DESC NULLS LAST
         Sort Method: top-N heapsort  Memory: 25kB
         ->  Index Scan using rum_tsvector_id_idx on rum  (cost=3.00..617.12 rows=750 width=48) (actual time=14.443..28.066 rows=50000 loops=1)
               Index Cond: (tsvector @@ '''cow'''::tsquery)
 Planning time: 0.253 ms
 Execution time: 41.702 ms
(8 rows)

– I asked only for top 10 records, but rum idx scan takes all 50k records with word "cow" (rows=50000) and then sorting happens.

Without nulls last, it's the same (needed to drop PK since it preferred using it instead of RUM idx):

[local]:5432 nikolay@test=# alter table rum drop constraint rum_pkey;
ALTER TABLE
Time: 49.426 ms
[local]:5432 nikolay@test=# explain analyze select * from rum where tsvector @@ plainto_tsquery('simple', 'cow') order by id desc  limit 10;
                                                                    QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=3464.92..3464.94 rows=10 width=48) (actual time=40.989..40.993 rows=10 loops=1)
   ->  Sort  (cost=3464.92..3589.82 rows=49960 width=48) (actual time=40.987..40.988 rows=10 loops=1)
         Sort Key: id DESC
         Sort Method: top-N heapsort  Memory: 25kB
         ->  Index Scan using rum_tsvector_id_idx on rum  (cost=3.00..2385.30 rows=49960 width=48) (actual time=14.310..27.581 rows=50000 loops=1)
               Index Cond: (tsvector @@ '''cow'''::tsquery)
 Planning time: 0.165 ms
 Execution time: 41.044 ms
(8 rows)

Should it be so? Is it really different from what it was before?

NikolayS commented 6 years ago

Oh. Understood :-)

[local]:5432 nikolay@test=# explain analyze select * from rum where tsvector @@ plainto_tsquery('simple', 'cow') order by id <=> 0 limit 10;
                                                                QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=3.00..3.53 rows=10 width=56) (actual time=18.212..18.221 rows=10 loops=1)
   ->  Index Scan using rum_tsvector_id_idx on rum  (cost=3.00..2635.10 rows=49960 width=56) (actual time=18.210..18.218 rows=10 loops=1)
         Index Cond: (tsvector @@ '''cow'''::tsquery)
         Order By: (id <=> 0)
 Planning time: 0.179 ms
 Execution time: 18.315 ms

-- desc:

[local]:5432 nikolay@test=# explain analyze select * from rum where tsvector @@ plainto_tsquery('simple', 'cow') order by id <=> 10^30 limit 10;
                                                                    QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=3714.72..3714.74 rows=10 width=56) (actual time=53.081..53.084 rows=10 loops=1)
   ->  Sort  (cost=3714.72..3839.62 rows=49960 width=56) (actual time=53.080..53.082 rows=10 loops=1)
         Sort Key: (((id)::double precision <=> '1e+30'::double precision))
         Sort Method: top-N heapsort  Memory: 26kB
         ->  Index Scan using rum_tsvector_id_idx on rum  (cost=3.00..2635.10 rows=49960 width=56) (actual time=14.353..39.079 rows=50000 loops=1)
               Index Cond: (tsvector @@ '''cow'''::tsquery)
 Planning time: 0.155 ms
 Execution time: 53.150 ms
(8 rows)

-- looks good!

za-arthur commented 6 years ago

Unfortunately, ordering within index can be performed only using some value to compare. Please try out the query:

$ select * from rum where tsvector @@ plainto_tsquery('simple', 'cow') order by id <=> 1000000 limit 10;
$ select * from rum where tsvector @@ plainto_tsquery('simple', 'cow') order by id <=> 0 limit 10;

Depending on the compared value results a different.

za-arthur commented 6 years ago

Oh. Understood :-)

Yep :)

NikolayS commented 6 years ago

One more question. In my tests, 'asc' ordering (id <=> 0) works significantly faster than 'desc' (id <=> 1000000 or id <=> 10^10 or id <=> 10^30 between which I didn't notice any difference). It's ~15-20ms vs 40-50ms.

Any ways to improve it somehow? (using different index definition or something)

za-arthur commented 6 years ago

In your second query 10^30 is not int but double:

Sort Key: (((id)::double precision <=> '1e+30'::double precision))

Use maximum integer value instead.

za-arthur commented 6 years ago

I suppose maximum value is 2147483647

NikolayS commented 6 years ago

Okay, used 10^5, but it didn't help to eliminate the difference between "asc" and "desc":

[local]:5432 nikolay@test=# explain analyze select * from rum where tsvector @@ plainto_tsquery('simple', 'cow') order by id <=> 10^5 limit 10;
                                                                    QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=3714.72..3714.74 rows=10 width=56) (actual time=53.174..53.178 rows=10 loops=1)
   ->  Sort  (cost=3714.72..3839.62 rows=49960 width=56) (actual time=53.173..53.175 rows=10 loops=1)
         Sort Key: (((id)::double precision <=> '100000'::double precision))
         Sort Method: top-N heapsort  Memory: 26kB
         ->  Index Scan using rum_tsvector_id_idx on rum  (cost=3.00..2635.10 rows=49960 width=56) (actual time=14.427..39.331 rows=50000 loops=1)
               Index Cond: (tsvector @@ '''cow'''::tsquery)
 Planning time: 0.153 ms
 Execution time: 53.231 ms
(8 rows)

Time: 53.958 ms
[local]:5432 nikolay@test=# explain analyze select * from rum where tsvector @@ plainto_tsquery('simple', 'cow') order by id <=> 0 limit 10;
                                                                QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=3.00..3.53 rows=10 width=56) (actual time=18.474..18.482 rows=10 loops=1)
   ->  Index Scan using rum_tsvector_id_idx on rum  (cost=3.00..2635.10 rows=49960 width=56) (actual time=18.473..18.479 rows=10 loops=1)
         Index Cond: (tsvector @@ '''cow'''::tsquery)
         Order By: (id <=> 0)
 Planning time: 0.139 ms
 Execution time: 18.536 ms
(6 rows)

How can I get "asc" speed for "desc" (say, if I need desc in most cases)?

NikolayS commented 6 years ago

oh, when I do order by id <=> 10^5 it takes 50000 records again. So basically, "inline sorting" doesn't work by some reason.

NikolayS commented 6 years ago

Oh. This is my usual mistake – I shouldn't use 10^5 instead of constant 100000. But:

[local]:5432 nikolay@test=# explain analyze select * from rum where tsvector @@ plainto_tsquery('simple', 'cow') order by id <=> 1000000 limit 10;
                                                                QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=3.00..3.53 rows=10 width=56) (actual time=31.367..31.375 rows=10 loops=1)
   ->  Index Scan using rum_tsvector_id_idx on rum  (cost=3.00..2635.10 rows=49960 width=56) (actual time=31.365..31.371 rows=10 loops=1)
         Index Cond: (tsvector @@ '''cow'''::tsquery)
         Order By: (id <=> 1000000)
 Planning time: 0.133 ms
 Execution time: 31.427 ms
(6 rows)

Time: 32.084 ms
[local]:5432 nikolay@test=# explain analyze select * from rum where tsvector @@ plainto_tsquery('simple', 'cow') order by id <=> 0 limit 10;
                                                                QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=3.00..3.53 rows=10 width=56) (actual time=18.329..18.338 rows=10 loops=1)
   ->  Index Scan using rum_tsvector_id_idx on rum  (cost=3.00..2635.10 rows=49960 width=56) (actual time=18.328..18.335 rows=10 loops=1)
         Index Cond: (tsvector @@ '''cow'''::tsquery)
         Order By: (id <=> 0)
 Planning time: 0.139 ms
 Execution time: 18.393 ms
(6 rows)

-- "asc" is almost 2x faster than "desc" (18ms vs 31ms). Is it possible to have the faster "desc"?

za-arthur commented 6 years ago

I understood what's happening. The exponentiation operator returns result in double or numeric type:

$ \do ^
                                      List of operators
   Schema   | Name |  Left arg type   |  Right arg type  |   Result type    |  Description   
------------+------+------------------+------------------+------------------+----------------
 pg_catalog | ^    | double precision | double precision | double precision | exponentiation
 pg_catalog | ^    | numeric          | numeric          | numeric          | exponentiation

So you need to cast the result (10^5)::int:

$ explain (analyze) select * from rum where tsvector @@ 'cow'::tsquery order by id <=> (10^6)::int limit 10;
                                                                QUERY PLAN                                                                 
-------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=12.00..13.43 rows=10 width=56) (actual time=21.027..21.031 rows=10 loops=1)
   ->  Index Scan using rum_tsvector_id_idx on rum  (cost=12.00..7171.81 rows=50125 width=56) (actual time=21.024..21.028 rows=10 loops=1)
         Index Cond: (tsvector @@ '''cow'''::tsquery)
         Order By: (id <=> 1000000)
 Planning time: 0.181 ms
 Execution time: 21.274 ms

About desc speed I can't tell anything useful for now. It seems that RUM works on same size of data, but the query needs more operations to perform.

andreak commented 6 years ago

Yes;

select pg_typeof(10^5);
┌──────────────────┐
│    pg_typeof     │
├──────────────────┤
│ double precision │
└──────────────────┘
(1 row)
NikolayS commented 6 years ago

Yeah, yeah, I always do that mistake. For years. Никогда такого не было и вот опять.

But what is more interesting/relevant here:

andreak commented 6 years ago

Good questions, but don't hijack this issue. Start new separate issues for those.

NikolayS commented 6 years ago

No-no, this is my issue and those questions are both belong to it for sure :))

Repo owners can move them to other places if they think it's better.

andreak commented 6 years ago

I think nobody likes never-ending issues due to new stuff creaping in. In the issue-list summary it seems like this issue is still an issue while in fact it is indeed closed/implemented. Being good citizens we should help keeping the issue-list clean:-) Let's give the repo owners more time to code by helping them out?

NikolayS commented 6 years ago

@andreak it's all offtopic and provocative.

Having all information in one single place regarding on particular topic can be very helpful and might save time. It's up to the repo owner how to manage issues – it depends on concrete process inside the particular dev team. Don't pretend that your practices are the only possible and right choice, it's not a good citizen's behavior.

Consider this matter as closed, leave these questions to repo owners and let's focus on tech stuff.

andreak commented 6 years ago

My point was; let's keep this issue closed. Glad we agree on that:-)

za-arthur commented 6 years ago

I created two issues: #35 and #36. Though I'm not sure that the first is an issue. You can continue the conversation there :).