Investigate why Zambia IRS reports performance is bad

moshthepitt commented 4 years ago

Noticed that the Zambia reports have problems:

zambia_irs_structures takes hours to refresh
zambia_irs_export takes a day to refresh

^^ this is on Reveal stage.

Need to look into this and try to fix.

pld commented 4 years ago

Is this the query for zambia_irs_structures?

https://github.com/OpenSRP/opensrp-reveal-datawarehouse/blob/master/3-reveal/migrations/5-IRS/2-Zambia-2019/verify/zambia_irs_structures.psql

And this for zambia_irs_export?

https://github.com/OpenSRP/opensrp-reveal-datawarehouse/blob/master/3-reveal/migrations/5-IRS/2-Zambia-2019/verify/zambia_irs_export.psql

pld commented 4 years ago

Uh nevermind, verify is obvi not the right place... let me find the correct ones

pld commented 4 years ago

structures: https://github.com/OpenSRP/opensrp-reveal-datawarehouse/blob/master/3-reveal/migrations/5-IRS/2-Zambia-2019/deploy/zambia_irs_structures.psql

export: https://github.com/OpenSRP/opensrp-reveal-datawarehouse/blob/master/3-reveal/migrations/5-IRS/2-Zambia-2019/deploy/zambia_irs_export.psql

moshthepitt commented 4 years ago

its in the deploy directory there

and thats one part, all the queries here are interconnected

pld commented 4 years ago

k, is just the structures query taking hours? or is it and all it's dependent queries taking hours?

Posting the query plan would help.

Some specific thoughts/questions below, and they're gonna be naive b/c I haven't looked at these yet so... bear w/me

on structures:

why do we need a UUID v5?
wondering how filter and partition are affecting performance
wonder how JSON queries are affecting performance
does this [https://github.com/OpenSRP/opensrp-reveal-datawarehouse/blob/master/3-reveal/migrations/5-IRS/2-Zambia-2019/deploy/zambia_irs_structures.psql#L18] have a group index?
is ordering important? I wonder what the perf implications of this are, if we can group index here too
I'm not sure I understand why there's another join on tasks here

pld commented 4 years ago

for export:

Wonder about performance of ST_AsGeoJSON, what's the format of the geometry w/o this?
Same w/the JSON queries might want to process this earlier in the pipeline, at least understand the performance implications
I imagine ST_Centroid is something the client requested? I wonder if this gets cached, since we're calcing centroid of the same geometry on this line and 71
We have 5 joins at the end have we checked to see if they're ordered by most restrictive first? Like this example:
Consider that TBL1, TBL2 AND TBL3 have 10, 100 and 1000 records respectively. The condition TBL1.ID=TBL2.ID returns only 5 records, whereas TBL2.ID=TBL3.ID returns 100 records, then it’s better to join TBL1 and TBL2 first so that lesser number of records get joined with TBL3.
- it reads like join order optimization is NOT something the query plan will do for you, but maybe it will?
- it also reads like it's worth overriding Postgres defaults to test various join [strategies] manually and like parallel hash joins new in 11 can give a [speed boosthttps://www.enterprisedb.com/postgres-tutorials/parallel-hash-joins-postgresql-explained)

^ last sub-items apply to any query w/joins

moshthepitt commented 4 years ago

k, is just the structures query taking hours? or is it and all it's dependent queries taking hours?

The structure query zambia_irs_structures.psql would be the best one to attack as it is the foundation for the rest and also quite slow

why do we need a UUID v5?

We need to get a deterministic id based on structure id and task id. Maybe we can do it differently? Though I am not sure if this will move the needle very much

Will look into the rest

pld commented 4 years ago

Cool, that sounds good, these are brainstorming ideas and please approach as such, we'll have to approach analytically tweaking, re-running

pld commented 4 years ago

A

So along these lines that I have been thinking of is to make it something of a hybrid table/view/something i.e. not a true materialized view.

We only need to update for plans that are receiving data, and it should be somehow possible to compute the information for the old plans once and merge that with information for plans that are receiving data.

I think that something like that approach may be the most sustainable long term. e.g. we have something like 4x the data that we had last year, and the season hasn't even started.

B

Plans, and tasks, do have status field that indicates if active, retired etc. And this query doesn't check for that at the moment. It would help, and not hurt to leave out retired plans, and canceled tasks.

pld commented 4 years ago

Cool, let's do B for now, that's what I was thinking of.

I agree A is a good idea, we need to discuss approach here more though and it'll take longer. A heuristic way to do A through B is to create a have a last_used_at (or something, there's a better way than this I'm sure) flag that we set to now whenever something related to the plan is touched, then we modify B's approach to only consider plans w/last_used_at greater than the most recent time the view was updated -- needs more thought for sure, anyhow for now, B seems like quckish win?

onaio / reveal-frontend

Investigate why Zambia IRS reports performance is bad #1289

A

B