jhu-bids / TermHub

Web app and CLI tools for working with biomedical terminologies. https://github.com/orgs/jhu-bids/projects/9/views/7
https://bit.ly/termhub
GNU General Public License v3.0
8 stars 10 forks source link

Grouping: Optimize graph data download & display (AKA Very large cset crashes) #514

Open Sigfried opened 1 year ago

Sigfried commented 1 year ago

Overview

When there are too many concepts in a cset, it causes problems.

Goals:

  1. Don't fetch so much data that it crashes or hobbles browser
  2. Don't display so much data that crashes or hobbles browser

Implementation

See frontend testing documentation

Sub-task list

Involves investigation strategies to try to find out the nature of the problem, as well as implementing some of the "optimization options". Sorted in the order we plan to try them.

Figure out what's causing performance and scaling problems

Done:

joeflack4 commented 1 year ago
Context: Screenshots of cset counts frequency histograms

This context may be useful for evaluating at what size we want to consider that concept sets are "too large" and need this filtering. However, I think the better approach is to do some manual testing in the wild and see how long csets take to load (if they load at all) at various sizes, on various different people's computers, and set a default for filtering out large csets based on that. histogram_concepts_by_cset 2 histogram_concepts_by_cset

joeflack4 commented 1 year ago

@Sigfried At the DL meeting this Thursday, the main discussion was between Harold and I and it was on this topic. I edited your original post. Please take a look when you're back.

joeflack4 commented 11 months ago

Discussion 2023/09/18 meeting

Additional discussion for "4. General hiding of concepts / subtrees".

Steps

  1. Remove RxNorm
  2. Remove 0-patient concepts
  3. Collapse deepest levels

Difficulties Problem is that when you remove some of these things, you don't remove links between relevant parts of the tree. E.g. (a: normal node) -> (b: 0-patient concept) -> (c: normal node). If we remove 'b', we need to actually add it back because it has a child that is a normal node. Should only remove it if it is a leaf, or if all of its children are also to be removed.

Implementation If you hide 100 0-patient concepts, add a stub concept that indicates that you did that, and allows user to add back in.

joeflack4 commented 11 months ago

Cache clearing options

joeflack4 commented 11 months ago

Some notes on where Siggie was at in terms of progress and questions, based on our Scrum meeting yesterday:

performance and testing

joeflack4 commented 11 months ago

Hope and Siggie put this together in regards to testing: https://docs.google.com/spreadsheets/d/1awyFHCbmr32mR9f1d7i0qIMHi17PESkTlVs58isEECI/edit#gid=1494465654

joeflack4 commented 11 months ago

We looked at the above sheet today. Siggie suspects that when you compare csets but then clear selected csets, memory does not go down. If this is so, is it because our cache is using memory and not disk? Do we want to change that in any way?

We decided that different tests should have different timeout thresholds, some of them being longer than 1 minute because that's how long it takes to render the comparison.

We decided to see qualitatively what's going on when we load these large csets. We tried this as an example: https://icy-ground-0416a040f.2.azurestaticapps.net/cset-comparison?codeset_ids=909552172 It over 75k expressions, without includeDescendants, so same number of members. On Siggie's computer, it failed with "Oh Snap!" after about 1-2 minutes. On Joe's computer, after about a minute, the dialogue popped up with "page unresponsive; wait or exit?". It was laggy. Joe could not open up the developer console or change the zoom of the windows. Joe chose wait. Joe's page quit with "Aw, Snap!" after about 3-4 minutes.

Perhaps one issue is that DataCache & localStorage interaction. We have a DataCache object that gets stuff if you haven't already gotten it before. Otherwise gives from cache. Every time get new data, needs to write out the whole cache again to localStorage. Involves uncompress/compress and JSON stringify/parse.

Things Siggie will try to complete some of these, and take a stab at others:

  1. Test: Behavior w/out cache
  2. Implement: "3. Hiding pre-set concept types (RxNorm extension)". See https://github.com/jhu-bids/TermHub/issues/547
  3. Investigate: Any bugs / performance bottlenecks? and maybe:
  4. Try: "6. Don't cache very large concept sets"

The rest of the strategies after that will take some time to implement.

joeflack4 commented 10 months ago

We decided to create a test_n3c_no_rxnorm schema.

Lots of relevant crossover concerns with the octopus graph (see October 6 notes).