Slow plotting - Githubissues

svrieze commented 9 years ago

Many of the health information plots generated at runtime are quite slow. So slow that it interrupts the user experience. Example: the Tobacco Use question "How old were you when you tried your first cigarette" took 20 seconds to load today.

svrieze commented 9 years ago

The slow plotting issue applies to almost all plots (almost all take >5 seconds), but these ones are especially slow: --> About how old were you when you had your first drink? --> How old were you the first time you got drunk? --> Does anyone in your family have type 1 diabetes? --> Does anyone in your family have type 2 diabetes? --> What is the highest grade or level of school you have completed or the highest degree you have received? --> Describe your employment --> Personality (also, the bars show up much later than the percentile ranks) --> How many children do you have? --> Over the past few years, have you had any problems with sleep... --> Over the past few years, have you had problems with memory... --> Gluten --> Both psoriasis plots --> Do you have any difficulty with your hearing? --> Both gastrointestinal

kevinwli commented 9 years ago

In most charts, we do almost everything at run time, including

1) query the smallest answer 2) query the max. answer 3) query total number of answers to calculate best columns number ( using log formula ) for the chart 4) compute intervals for each bar 5) then query number of answers for each interval ( may be up to 8-9 queries) in this step

and most queries has sub-query (or nested query) to avoid double counting if people answered multiple times of the survey.

Do we want to change to a simpler concept/algorithm for this?

On Tue, Apr 7, 2015 at 3:27 PM, svrieze notifications@github.com wrote:

The slow plotting issue applies to almost all plots (almost all take >5 seconds), but these ones are especially slow: --> About how old were you when you had your first drink? --> How old were you the first time you got drunk? --> Does anyone in your family have type 1 diabetes? --> Does anyone in your family have type 2 diabetes? --> What is the highest grade or level of school you have completed or the highest degree you have received? --> Describe your employment --> Personality (also, the bars show up much later than the percentile ranks) --> How many children do you have? --> Over the past few years, have you had any problems with sleep... --> Over the past few years, have you had problems with memory... --> Gluten --> Both psoriasis plots --> Do you have any difficulty with your hearing? --> Both gastrointestinal

— Reply to this email directly or view it on GitHub https://github.com/genesforgood/Genes-for-Good/issues/4#issuecomment-90706594 .

svrieze commented 9 years ago

Any chance we know if one step is more of a bottleneck? I would think step 5 and the sub-queries are the culprits.

Great to have it all go at runtime, and no doubt a simpler algorithm can be found. Any possibility of parallelizing independent queries, such as 1&2 together, and step 5? That might actually be technically harder to implement than a simpler algorithm. Another potentially simpler option is to upgrade or expand hardware capabilities.

kevinwli commented 9 years ago

I agree with that, by having much larger size of real data in our database, sub queries may not be that important and so may be dropped.

Also, I can reduce the number of queries in step 5, instead to process data in RAM using arrays. This would work, in particular when we are going to use a bigger box. But it would take time to implement depending on task priority.

Multi-threads in php is possible , but may not be worth of doing. Actually first 2 steps would only take less than 0.1 second for each and in most cases, the code has be executed in order.

On Tue, Apr 7, 2015 at 4:13 PM, svrieze notifications@github.com wrote:

Any chance we know if one step is more of a bottleneck? I would think step 5 and the sub-queries are the culprits.

Great to have it all go at runtime, and no doubt a simpler algorithm can be found. Any possibility of parallelizing independent queries, such as 1&2 together, and step 5? That might actually be technically harder to implement than a simpler algorithm.

— Reply to this email directly or view it on GitHub https://github.com/genesforgood/Genes-for-Good/issues/4#issuecomment-90717202 .

abecasis commented 9 years ago

My proposal on how to speed these up:

Run a query (probably with the filter for duplicate answers) that counts the number of answers of each type.

Store the answers and counts in an appropriate array.

Then, the previous steps can all proceed with no additional queries: you should be able to (a) find the largest answer, (b) find the smallest answer, (c) find the total number of answers, and (d) query answers for each interval.

I think with a single query per chart and the rest of the processing limited to simple counting and sums across arrays that have a handful of elements, we should see much improved performance.

svrieze commented 9 years ago

If a lot easier you could also consider generating each plot at a regular interval (every 10 minutes?). At this point I don't think the user would notice the difference between bar plots generated 9 minutes ago and those generated at runtime. The scatterplots would still need to be generated at runtime because the user's value is shown directly.

I'm noticing that the health trackers that show averages also take significant time to generate.

abecasis commented 9 years ago

Although, for the scatter plot, you could use a static query for everybody and then add the users points.

G

PS. sent from cell phone using voice and scribble recognition features. Sorry for typos.

-----Original Message----- From: "svrieze" notifications@github.com Sent: ‎4/‎9/‎2015 11:41 PM To: "genesforgood/Genes-for-Good" Genes-for-Good@noreply.github.com Cc: "abecasis" goncalo@umich.edu Subject: Re: [Genes-for-Good] Slow plotting (#4)

If a lot easier you could also consider generating each plot at a regular interval (every 10 minutes?). At this point I don't think the user would notice the difference between bar plots generated 9 minutes ago and those generated at runtime. The scatterplots would still need to be generated at runtime because the user's value is shown directly. — Reply to this email directly or view it on GitHub.

kevinwli / Genes-for-Good

Slow plotting #4