Open moshthepitt opened 4 years ago
Is this the query for zambia_irs_structures
?
And this for zambia_irs_export
?
Uh nevermind, verify is obvi not the right place... let me find the correct ones
its in the deploy directory there
and thats one part, all the queries here are interconnected
k, is just the structures query taking hours? or is it and all it's dependent queries taking hours?
Posting the query plan would help.
Some specific thoughts/questions below, and they're gonna be naive b/c I haven't looked at these yet so... bear w/me
on structures:
for export:
Consider that TBL1, TBL2 AND TBL3 have 10, 100 and 1000 records respectively. The condition TBL1.ID=TBL2.ID returns only 5 records, whereas TBL2.ID=TBL3.ID returns 100 records, then it’s better to join TBL1 and TBL2 first so that lesser number of records get joined with TBL3.
- it reads like join order optimization is NOT something the query plan will do for you, but maybe it will?
- it also reads like it's worth overriding Postgres defaults to test various join [strategies] manually and like parallel hash joins new in 11 can give a [speed boosthttps://www.enterprisedb.com/postgres-tutorials/parallel-hash-joins-postgresql-explained)
^ last sub-items apply to any query w/joins
k, is just the structures query taking hours? or is it and all it's dependent queries taking hours?
The structure query zambia_irs_structures.psql
would be the best one to attack as it is the foundation for the rest and also quite slow
why do we need a UUID v5?
We need to get a deterministic id based on structure id and task id. Maybe we can do it differently? Though I am not sure if this will move the needle very much
Will look into the rest
Cool, that sounds good, these are brainstorming ideas and please approach as such, we'll have to approach analytically tweaking, re-running
More on zambia_irs_structures
The other angle of attack is how do we restrict these queries so the are only operating on potential new data? Or at least only active plans?
Do plans even have an active status? But we're not doing a join on plans here it looks like, could be restrictive on tasks.plan_identifiers
to avoid a join, but would need to have a pre-existing list of plans, which we'd need to compute
Would want to look at the explain (or maybe this is an obvi sql thing I've forgotten) but is this where restricting before all the joins?
The other angle of attack is how do we restrict these queries so the are only operating on potential new data? Or at least only active plans?
I interpreted that two ways:
So along these lines that I have been thinking of is to make it something of a hybrid table/view/something i.e. not a true materialized view.
We only need to update for plans that are receiving data, and it should be somehow possible to compute the information for the old plans once and merge that with information for plans that are receiving data.
I think that something like that approach may be the most sustainable long term. e.g. we have something like 4x the data that we had last year, and the season hasn't even started.
Plans, and tasks, do have status field that indicates if active, retired etc. And this query doesn't check for that at the moment. It would help, and not hurt to leave out retired plans, and canceled tasks.
Cool, let's do B for now, that's what I was thinking of.
I agree A is a good idea, we need to discuss approach here more though and it'll take longer. A heuristic way to do A through B is to create a have a last_used_at
(or something, there's a better way than this I'm sure) flag that we set to now whenever something related to the plan is touched, then we modify B's approach to only consider plans w/last_used_at
greater than the most recent time the view was updated -- needs more thought for sure, anyhow for now, B seems like quckish win?
Noticed that the Zambia reports have problems:
^^ this is on Reveal stage.
Need to look into this and try to fix.