gost / server

GOST - Go implementation of OGC SensorThings API
MIT License
61 stars 19 forks source link

Postgress Database performance improvement #152

Closed dschowta closed 5 years ago

dschowta commented 6 years ago

Hi GOST team, I am running GOST with postgresql. The observations in my deployment include camera data as well. As the number of observations grows, it is getting slower, especially when I try to get observations related to a datastream. The delay is approximately 6 to 7 seconds.

Therefore, I need a suggestion, how can I improve the performance of GOST queries for observations? Please let me know if you want more details

Best, Shreekantha

bertt commented 6 years ago

Hi, can you send more details? Like software version, amount of records in datastreams/observation, observation type used and the slow sample query.

dschowta commented 6 years ago

Total number of observations: 6000000+ the number of Datastreams: 77 Number of observations in a queried datastream: 53000+ Version: Gost V 0.6, gost-db v 0.5 (both latest from Docker hub) The query I used are: GET http://<url>/v1.0/Datastreams(46)/observations and GET http://<url>/v1.0/observations

average size of observation in Datastrem no. 46: 117KB (Mostly multimedia data) I used docker compose from the GOST. With Changed GOST_SERVER_MAX_ENTITIES to 100

tebben commented 6 years ago

Having the docker latest version does not ensure you have the latest version, this is an automated build triggered on a Github commit. Seems like a fix for faster queries was commited on 20 march of 2018 https://github.com/gost/gost-db/commit/1b51d975bd99c7b85e7a047ba26a9677f2877aaf and https://github.com/gost/server/commit/1a229c7beefb38e85f054af1bd4e9e96df8629ea Maybe you can check the docker image creation date ($docker images)

I created a 60 GB database to try to recreate your issue.

Testing browser and raw query times with 100 entities

browser time raw db query request
574 ms 202 ms /Datastreams(1)/Observations
590 ms 223 ms /Datastreams?$filter=id eq 1&$expand=Observations
20 ms 13 ms /Datastreams(2)/Observations
18 ms 13 ms /Datastreams?$filter=id eq 2&$expand=Observations
---- ms --- ms /Observations

Browser times are without data transfer time and only wait time until a response is send back, raw db query is without result transfer to gost-server, only the time it takes the database to fully execute the query.

I didn't experience your 6-7 seconds delay for the given request http:///v1.0/Datastreams(1)/Observations, query time on the +/- 130kb observations datastream is around 200ms if I include transfer from database to gost-server, parsing into entities en returning it is around 550ms. When you add data transfer to the browser it will result in around 3.5 seconds total. I think the queries are fast enough and I don't see any big slowdown on bigger databases. If you do have the version from the end of march we need some more information such as biggest observation in database and your raw query times.

Also requesting top 100 observations on this database (/v1.0/Observations) gave me a timeout, looking into this next week.

dschowta commented 6 years ago

Thank you. Requesting top 100 observations on this database (/v1.0/Observations) gave a timeout for me too. Here are the information about my docker images: gost-db image created on : 2018-03-20 gost image created on : 2018-05-17

I agree to your point that it was network and browser which took more time because when I made a GET query using curl I could see a 2.5 seconds of difference between the request from a localhost (host containing docker containers) and a remote host. Moreover, the browser took time for rendering too. Hence, now I can confirm from my side following: The curl request from localhost (inspired from this blog post): Querying data stream with 53000+ observations:

$ curl -w "@curl-format" -o outfile2 -s "/v1.0/Datastreams(17)/Observations" time_namelookup: 0.004 time_connect: 0.005 time_appconnect: 0.000 time_pretransfer: 0.005 time_redirect: 0.000 time_starttransfer: 2.543

time_total: 2.620

Querying datastream with 26000+ observations:

$ curl -w "@curl-format" -o outfile2 -s "/v1.0/Datastreams(46)/Observations" time_namelookup: 0.004 time_connect: 0.005 time_appconnect: 0.000 time_pretransfer: 0.005 time_redirect: 0.000 time_starttransfer: 1.688

time_total: 1.754

Regarding your query about the biggest observation and raw query times, I am not aware of how to do it. I will try to find it out gradually. If you have a quicker way, then it would save my time.

Thank you. PS: I edited to accomodate two datastreams of different number of observations

dschowta commented 6 years ago

I just observed that fetching the FeatureOfInterest corresponding to an observation is also slower. eg : /v1.0/Observations(6329953)/FeatureOfInterest takes approximately 5+ seconds.

justb4 commented 6 years ago

Same here, using latest Docker version (march 23, 2018) gave timeouts for queries on the complete Observations collection (8 million records in DB), for example:

Getting all Observations within time-period:

/v1.0/Observations?$filter=phenomenonTime gt ‘2018-01-29T11:00:00.000Z’ and phenomenonTime le ‘2018-01-29T12:00:00.000Z’&$select=result,phenomenonTime,parameters

or last 100 observations:

/v1.0/Observations?$top=100

Getting the last Observations for all Things (about 100) and their Locations and Datastreams (2-10) goes quite fast:

/v1.0/Things?$select=id,name,properties&$expand=Locations($select=location),Datastreams($select=id,name),Datastreams/Observations($select=id,phenomenonTime,result;$top=1)

Just a thought: Time-related attributes (phenomenonTime, resultTime) of Observation are not indexed within the database observation table. Though most Observation entity-attributes are encoded within a jsonb structure, it would still be possible to create indexes for jsonb-fields, see e.g. here and here. But "resultTime" and "phenomenonTime" JSON-fields are defined as strings...

.. Or add extra PG timestamp fields to observation table for indexing and querying. (I now see this was already discussed in gost/gost-db#3)

Also the Observations id key-attribute does not seem to have a dedicated (Primary Key) index.

tebben commented 5 years ago

Request timeout on all Observations should be fixed by adding an index to observation.id: https://github.com/gost/gost-db/commit/4b2c063c0f595eabcafb84be3967960e4e9b3591, I thought I had some problems with the query planner before with indexes on the observation id (expand queries extremely slow) but it seems to work just fine. The reason for not having indexes on time related fields in observation is because it will slow down the speed of inserting observations, we have to find some middle ground between the post and get speed. Let's see if we can use timescaleDB https://github.com/gost/gost-db/issues/4

dschowta commented 5 years ago

gost/gost-db@4b2c063 solves the delay