Database performance issues

gijzelaerr commented 9 years ago

At the moment the database can't keep up the way we use it. This problem will get worse over time.

We have a list of short term (hacky) solutions to tackle the problem, but to solve this problem for the long term we may need to rethink our foundations.

This issue will function as a meta issue for the various solutions/reseach/discussions.

timstaley commented 9 years ago

I agree that in the long term, we need to make the database analysis run as fast as possible, but I think we should be clear that the problem's we're encountering right now are probably to do with inefficient Django queries - perhaps we should have a separate issue for that, if you want to do some 'blue sky thinking' here...

gijzelaerr commented 9 years ago

well, I sort of agree, problem is that I don't see any short term ways to improve performance without removing some of the dynamic functionality.

gijzelaerr commented 9 years ago

Suddenly I released something. I think we had this idea before, but now it starts to make more sense: make a query scheduler and resultset viewer.

How this would work:

User gets simular interface to how current running catalog view works, a form with a submit button.
As soon as user hits submit button, a query is scheduled in the background
user can monitor a list of running queries in banana
Query scheduling should probably be done with celery, it is actually made for stuff like this
When query is done results are stored in a table
User can then query and sort per column the data in this table
User could/should cleanup old queries that are not used anymore

Advantages:

This doesn't make the actual computations faster, but would require the computations to be done only once. Resorting the already calculated data should be instant, which gives a much better user experience.
We can probably remove the SQL views

Disadvantages:

This would make banana and TKP even more intertwined, since it will add more banana logic into the TKP database. We can also try to store the results in a different database, not sure if that makes things better.
The table containing the results must be in a predefined format, per query and set of resulting columns we need to create a new type of table.
If the actually query is not that complex the workflow becomes a bit more annoying, since you need to do more actions to get to your results.

AntoniaR commented 9 years ago

This sounds like a really great idea, despite the disadvantages.

I don't mind Banana getting more intertwined with the TKP. I've discussed the outline with a few potential users that would not run TraP - they have mentioned they would simply populate a TraP database themselves and then use Banana. So I think the main users would want the TKP code anyway.
Fair enough, more work to get it right but probably better in the long term.
I suspect that our queries will get more and more complex with time - perhaps in the long term, you could also consider a place where people can manually enter their own SQL query?

gijzelaerr commented 9 years ago

response to 3. I suddenly remember a PostgreSQL feature named 'materialised views', check that out.

gijzelaerr commented 9 years ago

We have decided to store the augmented running catalog values in a new table called 'transient' at the end of the pipeline run. We then will visualise this in banana where people can query and filter the results. This should make banana fast again.

AntoniaR commented 9 years ago

Sounds like an excellent idea :D Not happy about the table name though. The vast majority of the sources in the table will not be transients - they're just all the unique sources in the field with variability parameters.

gijzelaerr commented 9 years ago

suggestions are welcome. augmentedrunningcatalog is just too long

timstaley commented 9 years ago

Hmm. How about variability or rcmetrics (runcat-metrics) or even just metrics

gijzelaerr commented 9 years ago

In my opinion table names should always be singular, not plural.

http://stackoverflow.com/a/338690/575524 http://stackoverflow.com/a/5841297/575524

metrics may be to generic, while rcmetric is a bit ugly. variability doesn't sound bad to me, but may be too generic also?

timstaley commented 9 years ago

Yeah, agreed on all counts. It's a tricky one. Other suggestions:

varmetric
varmeasure
varcalc

All a bit ugly, but we're in SQL land here.

AntoniaR commented 9 years ago

I agree a little ugly, but there are only so many options... Personally, I prefer varmetric or rcmetric as that's more representative of what they are.

timstaley commented 9 years ago

OK, let's call the new table varmetric unless @gijzelaerr feels strongly otherwise.

gijzelaerr commented 9 years ago

implemented in this branch, need to update the doc & banana and then we are probably done!

https://github.com/transientskp/tkp/tree/varmetric

gijzelaerr commented 9 years ago

Reading this issue back I think Tim is right, the original description is quite broad, but still relevant. The PR is just a short term workaround to tackle current usability problems. So when the varmetric PR is merged this issue should still stay open.

gijzelaerr commented 8 years ago

feedback from splendiddata:

Hi Gijs,

I think there is a resources issure here.  Postgres shares caching with the OS, so for the bes performance, it would be if all the objects could fit into memory.
but if i look at the to tables and indexes the total size is almost 20GB (19781.57814)

so to handle that effectively the most used objects would be found either in DB or OS cache.

             relname              | relowner | class_owner | schema | relkind | class_type | relpages |   mbytes   
----------------------------------+----------+-------------+--------+---------+------------+----------+------------
 extractedsource                  |    16388 | antoniar    | public | r       | table      |   958392 | 7487.43750
 assocxtrsource                   |    16388 | antoniar    | public | r       | table      |   345705 | 2700.82031
 assocxtrsource_runcat_xtrsrc_key |    16388 | antoniar    | public | i       | index      |   128380 | 1002.96875
 assocxtrsource_xtrsrc            |    16388 | antoniar    | public | i       | index      |   112949 |  882.41406
 extractedsource_y                |    16388 | antoniar    | public | i       | index      |   105830 |  826.79688
 extractedsource_ra               |    16388 | antoniar    | public | i       | index      |   105516 |  824.34375
 extractedsource_x                |    16388 | antoniar    | public | i       | index      |   105298 |  822.64063
 extractedsource_decl             |    16388 | antoniar    | public | i       | index      |   105236 |  822.15625
 extractedsource_z                |    16388 | antoniar    | public | i       | index      |   105186 |  821.76563
 extractedsource_decl_err         |    16388 | antoniar    | public | i       | index      |   100684 |  786.59375
 extractedsource_ra_err           |    16388 | antoniar    | public | i       | index      |   100208 |  782.87500
 extractedsource_image            |    16388 | antoniar    | public | i       | index      |    98332 |  768.21875
 assocxtrsource_pkey              |    16388 | antoniar    | public | i       | index      |    83415 |  651.67969
 extractedsource_pkey             |    16388 | antoniar    | public | i       | index      |    76911 |  600.86719

so, the best thing is to have enough system memory and or find ways to optimize the queries more.
from the postgres side you there are a few parameters that can help.  One is setting work_mem.  work_mem is a session setting, that can be changed.  this memory is used for sorts.  the default in your database is 100MB.  when i first ran the query, i noticed this in the explain analyze.   Buckets: 262144  Batches: 2  Memory Usage: 94543kB

The batches 2 means that is had to run two times to sort since work_mem was to little for  one batch.  
when i ran
set work_mem='500MB';

the time to run went from Total runtime: 132177.915 ms to Total runtime: 55866.071 ms
ant the  memeory usage Buckets: 524288  Batches: 1  Memory Usage: 188996kB

this setting will help for sort, but you still have the overhead of reading the big table assocxtrsource from disk.  There are 4 basic ways to get better performance there.

1. More system memory.
2. SSDs 
3. Find a way to use an index access path.
4. Stage some of the data in a temp table.

1 & 2 will be addressed by a new system, 4 is implemented in the varmetric PR.

gijzelaerr commented 8 years ago

Not sure what a index access path is, but it seems related to

http://www.postgresql.org/docs/9.2/static/index-cost-estimation.html. Will set the described settings as defaults on vlo.

bartscheers commented 8 years ago

It should be clear that we have two types of databases, a read/write one for processing the images by the Trap and one readonly database for viewing the data by Banana. However, we are just using one database for all. We are creating the problems ourselves by not separating it into two distinct databases. This will only increase when more users execute queries on the same read/write database, let alone the free SQL queries Antonia asked for. A first thing we can do for the rw (trap) database to make it faster is to use the COPY (INTO) statement instead of the insert statement we use. That will make a huge difference. Also, the queries django generates need more optimisation, since a lot of (really too much) columns are included in the queries which cause extra IO and server CPU cycles.

AntoniaR commented 1 month ago

This issue is a key target for the redesign of TraP in R7. Moving this issue as there may be useful insights in here for the redesign.

transientskp / tkp

Database performance issues #417