Open gijzelaerr opened 9 years ago
I agree that in the long term, we need to make the database analysis run as fast as possible, but I think we should be clear that the problem's we're encountering right now are probably to do with inefficient Django queries - perhaps we should have a separate issue for that, if you want to do some 'blue sky thinking' here...
well, I sort of agree, problem is that I don't see any short term ways to improve performance without removing some of the dynamic functionality.
Suddenly I released something. I think we had this idea before, but now it starts to make more sense: make a query scheduler and resultset viewer.
How this would work:
Advantages:
Disadvantages:
This sounds like a really great idea, despite the disadvantages.
response to 3. I suddenly remember a PostgreSQL feature named 'materialised views', check that out.
We have decided to store the augmented running catalog values in a new table called 'transient' at the end of the pipeline run. We then will visualise this in banana where people can query and filter the results. This should make banana fast again.
Sounds like an excellent idea :D Not happy about the table name though. The vast majority of the sources in the table will not be transients - they're just all the unique sources in the field with variability parameters.
suggestions are welcome. augmentedrunningcatalog is just too long
Hmm. How about variability
or rcmetrics
(runcat-metrics) or even just metrics
In my opinion table names should always be singular, not plural.
http://stackoverflow.com/a/338690/575524 http://stackoverflow.com/a/5841297/575524
metrics may be to generic, while rcmetric is a bit ugly. variability doesn't sound bad to me, but may be too generic also?
Yeah, agreed on all counts. It's a tricky one. Other suggestions:
varmetric
varmeasure
varcalc
All a bit ugly, but we're in SQL land here.
I agree a little ugly, but there are only so many options... Personally, I prefer varmetric or rcmetric as that's more representative of what they are.
OK, let's call the new table varmetric
unless @gijzelaerr feels strongly otherwise.
implemented in this branch, need to update the doc & banana and then we are probably done!
Reading this issue back I think Tim is right, the original description is quite broad, but still relevant. The PR is just a short term workaround to tackle current usability problems. So when the varmetric PR is merged this issue should still stay open.
feedback from splendiddata:
Hi Gijs,
I think there is a resources issure here. Postgres shares caching with the OS, so for the bes performance, it would be if all the objects could fit into memory.
but if i look at the to tables and indexes the total size is almost 20GB (19781.57814)
so to handle that effectively the most used objects would be found either in DB or OS cache.
relname | relowner | class_owner | schema | relkind | class_type | relpages | mbytes
----------------------------------+----------+-------------+--------+---------+------------+----------+------------
extractedsource | 16388 | antoniar | public | r | table | 958392 | 7487.43750
assocxtrsource | 16388 | antoniar | public | r | table | 345705 | 2700.82031
assocxtrsource_runcat_xtrsrc_key | 16388 | antoniar | public | i | index | 128380 | 1002.96875
assocxtrsource_xtrsrc | 16388 | antoniar | public | i | index | 112949 | 882.41406
extractedsource_y | 16388 | antoniar | public | i | index | 105830 | 826.79688
extractedsource_ra | 16388 | antoniar | public | i | index | 105516 | 824.34375
extractedsource_x | 16388 | antoniar | public | i | index | 105298 | 822.64063
extractedsource_decl | 16388 | antoniar | public | i | index | 105236 | 822.15625
extractedsource_z | 16388 | antoniar | public | i | index | 105186 | 821.76563
extractedsource_decl_err | 16388 | antoniar | public | i | index | 100684 | 786.59375
extractedsource_ra_err | 16388 | antoniar | public | i | index | 100208 | 782.87500
extractedsource_image | 16388 | antoniar | public | i | index | 98332 | 768.21875
assocxtrsource_pkey | 16388 | antoniar | public | i | index | 83415 | 651.67969
extractedsource_pkey | 16388 | antoniar | public | i | index | 76911 | 600.86719
so, the best thing is to have enough system memory and or find ways to optimize the queries more.
from the postgres side you there are a few parameters that can help. One is setting work_mem. work_mem is a session setting, that can be changed. this memory is used for sorts. the default in your database is 100MB. when i first ran the query, i noticed this in the explain analyze. Buckets: 262144 Batches: 2 Memory Usage: 94543kB
The batches 2 means that is had to run two times to sort since work_mem was to little for one batch.
when i ran
set work_mem='500MB';
the time to run went from Total runtime: 132177.915 ms to Total runtime: 55866.071 ms
ant the memeory usage Buckets: 524288 Batches: 1 Memory Usage: 188996kB
this setting will help for sort, but you still have the overhead of reading the big table assocxtrsource from disk. There are 4 basic ways to get better performance there.
1. More system memory.
2. SSDs
3. Find a way to use an index access path.
4. Stage some of the data in a temp table.
1 & 2 will be addressed by a new system, 4 is implemented in the varmetric
PR.
Not sure what a index access path is, but it seems related to
http://www.postgresql.org/docs/9.2/static/index-cost-estimation.html. Will set the described settings as defaults on vlo.
It should be clear that we have two types of databases, a read/write one for processing the images by the Trap and one readonly database for viewing the data by Banana. However, we are just using one database for all. We are creating the problems ourselves by not separating it into two distinct databases. This will only increase when more users execute queries on the same read/write database, let alone the free SQL queries Antonia asked for. A first thing we can do for the rw (trap) database to make it faster is to use the COPY (INTO) statement instead of the insert statement we use. That will make a huge difference. Also, the queries django generates need more optimisation, since a lot of (really too much) columns are included in the queries which cause extra IO and server CPU cycles.
This issue is a key target for the redesign of TraP in R7. Moving this issue as there may be useful insights in here for the redesign.
At the moment the database can't keep up the way we use it. This problem will get worse over time.
We have a list of short term (hacky) solutions to tackle the problem, but to solve this problem for the long term we may need to rethink our foundations.
This issue will function as a meta issue for the various solutions/reseach/discussions.