smnorris / fwapg

PostgreSQL tools for working with British Columbia's Freshwater Atlas
https://smnorris.github.io/fwapg
MIT License
9 stars 5 forks source link

add record count meta table #56

Open smnorris opened 3 years ago

smnorris commented 3 years ago

For pg_featureserv :

provide a function taking a table name as a parameter and returning the count of records in the table

https://github.com/poissonconsulting/fwapgr/issues/47

smnorris commented 3 years ago

Erwin B shows me how to write the function but pg_fs doesn't seem to like the regclass type. https://www.hillcrestgeo.ca/fwapg/functions/record_count/items.json

I don't want to work around the security features of above function and record count is only needed for a few tables that generally do not have changing counts.... a meta table seems easier to implement for now?

smnorris commented 3 years ago

Also, getting the record count takes forever - much better to have this cached somewhere:

$ time psql -c "select postgisftw.record_count('whse_basemapping.fwa_stream_networks_sp');"
 record_count
--------------
      4907325
(1 row)
real    0m50.719s
user    0m0.065s
sys 0m0.038s

(pg13, ubuntu20.04, 4CPU, 8G ram)

dr-jts commented 3 years ago

Also, getting the record count takes forever - much better to have this cached somewhere:

@smnorris Is it faster to get the feature count via Python and WFS? If so, I wonder why it is so much faster?

smnorris commented 3 years ago

I don't have WFS set up on my db - my comment references getting a WFS feature count from DataBC which I use for paging their collections. That request is pretty much instant, but it is (presumably) a geoserver / oracle back end.

smnorris commented 3 years ago

Hmm, I'm not sure why counting the records is so slow on that db, I may want to adjust something in the config. If I make a similar request on my local db it is <1s.

dr-jts commented 3 years ago

Could it be something to do with the indexing, or the table statistics? Or perhaps parallel workers allocated? Or how warm the DB instance is?

Doing an EXPLAIN ANALYZE might indicate something.

smnorris commented 3 years ago

Maybe? Indexes should be the same and autovacuum is on on both dbs. The query to the ubuntu db does not speed up on repeated calls.

Ubuntu 20.04 / PG 13.4:

hillcrestgeo=> explain analyze select count(*) from whse_basemapping.fwa_stream_networks_sp;
                                                                           QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=985836.25..985836.26 rows=1 width=8) (actual time=45619.851..45643.503 rows=1 loops=1)
   ->  Gather  (cost=985836.04..985836.25 rows=2 width=8) (actual time=45619.642..45643.488 rows=3 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Partial Aggregate  (cost=984836.04..984836.05 rows=1 width=8) (actual time=45594.097..45594.099 rows=1 loops=3)
               ->  Parallel Seq Scan on fwa_stream_networks_sp  (cost=0.00..979724.23 rows=2044723 width=0) (actual time=0.134..45281.439 rows=1635775 loops=3)
 Planning Time: 0.141 ms
 JIT:
   Functions: 8
   Options: Inlining true, Optimization true, Expressions true, Deforming true
   Timing: Generation 1.485 ms, Inlining 183.040 ms, Optimization 30.949 ms, Emission 30.149 ms, Total 245.624 ms
 Execution Time: 45644.363 ms
(12 rows)

MacOS, PG 13.3 (homebrew)

postgis=# explain analyze select count(*) from whse_basemapping.fwa_stream_networks_sp;
                                                                                                   QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=67819.06..67819.07 rows=1 width=8) (actual time=689.076..692.822 rows=1 loops=1)
   ->  Gather  (cost=67818.84..67819.05 rows=2 width=8) (actual time=688.895..692.813 rows=3 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Partial Aggregate  (cost=66818.84..66818.85 rows=1 width=8) (actual time=621.564..621.565 rows=1 loops=3)
               ->  Parallel Index Only Scan using fwa_stream_networks_sp_edge_type_idx1 on fwa_stream_networks_sp  (cost=0.43..61707.05 rows=2044719 width=0) (actual time=0.075..337.181 rows=1635775 loops=3)
                     Heap Fetches: 177
 Planning Time: 1.647 ms
 Execution Time: 692.853 ms
(9 rows)

But this is getting off into the weeds. For paging - an estimate (maybe rounded up to the request limit) might be adequate?

dr-jts commented 3 years ago

Looks like the MacOS query is using an index-only scan, and the Ubuntu instance is not. If the indexes are indeed defined in the same way, not sure why that would be. Something about the table stats perhaps?

Agreed that finding a faster way to get an approximate count would be nice. Perhaps using the pg_class or pg_stat_user tables as per this SO answer. But that doesn't handle queries with filter conditions, or (probably) views. So will need the brute-force method as well as a fallback. It might also be nice to have the application be able to choose when to populate this value, to avoid overhead when not needed (via some sort of API extension).

smnorris commented 3 years ago

Well they should be defined the same way but that doesn't mean they are. The index being used has a different name in one db - but both are basic btree indexes and were named by the system. I might try rebuilding both at some point.

To solve the fwapgr requirements in the short term I could just create an FWA meta table and manually populate it. The data rarely changes... and I don't want to get in the habit of serving huge collections either. Perhaps pg_fs is better but paging through large DataBC WFS requests gets unreliable after a few 100k.

dr-jts commented 3 years ago

Would it be helpful to have some way in the request API to indicate that a count of total number of records in query response should be calculated and returned? E.g. perhaps a (non-standard) query parameter totalCount which would return the query result size in some way?

And how should the query result size be returned? As a special JSON document? Or as the standard numberMatched response property? (If the latter, then the client might want to make an initial call with limit=0 to get just the numberMatched value, with no overhead of data transfer).

dr-jts commented 3 years ago

Perhaps pg_fs is better but paging through large DataBC WFS requests gets unreliable after a few 100k.

Unreliable how? Missing records? The unreliability might be a function of how Oracle implements offsetting. Postgres may or may not be better in this respect.

Hard to see why humans would want to wade through a large number of pages, but I guess automated systems might do data extracts that way, and thus require full reliability?

I'm starting to think about implementing a chunk of the CQL filter spec, so maybe that will provide a better alternative to reduce the size of query results?

smnorris commented 3 years ago

WFS Reliability : Paging through a large set can bail like this: https://github.com/smnorris/bcdata/issues/75 Downloading the entire streams or watershed table is a bad idea but something like DRA is usually ok. It seems to depend on data density, not just number of records.

I have not tried to debug - the issue might just be that my script doesn't handle network interruptions. Files are generally available for these larger collections and the files are far faster to download/load to postgres than requesting geojson feature by feature.

API questions: Those both sound reasonable from my somewhat uninformed viewpoint. I haven't been following the latest OGC / WFS standards and pygeoapi closely but working with a standard numberMatched that way seems fine to me. My feature server needs are minimal, @joethorley may have more thoughts.