graph-genome / Schematize

Visualization component of Pangenome Schematics for 1,000s of individuals and gigabase genomes.
http://graphgenome.org
Apache License 2.0
10 stars 8 forks source link

Pixel canvas for matrix scalability #89

Open josiahseaman opened 4 years ago

josiahseaman commented 4 years ago

After completing #87 we were testing out performance in the browser and noticed that while file loading is fairly fast, fresh renders were still a 2-4 second delay. It's also fairly common to get "waiting for unresponsive page" warnings from Chrome using the 200 individual SARS-CoV-2 dataset. I think the MatrixCells are slowing the browser down.

Here's some napkin math:

  1. Matrix Cells (squished down) are 5x2 pixels
  2. Screen is 1920 x 1280
  3. (1920 / 5) * (1280 / 2) = 245,760 MatrixCell objects to create, render and track mouse movements

While React elements are more responsive than HTML elements, which cap out around 5,000 - 10,000, having 40 times that number of elements is surely a problem. ((More testing needed?))

The Solution

One ComponentRect can do the mouse tracking and mouse over text for all cells inside it. With sparse format, we may need a second lookup table, but it is doable. Instead of having MatrixCell elements, the ComponentRect has a single pixel canvas texture that can be painted on with the appropriate coordinates an colors. All the color logic of MatrixCell would move to this canvas. Link Columns would remain unaltered.

Drawbacks

This is a fair amount of development to get the last bit of scalability for screen saturation. It could be argued that row filtering, or picking unique examples in the current screen is more meaningful than cramming as much content onto the screen as possible. Filtering and representation features would be useful development in their own right, apart from performance.

josiahseaman commented 4 years ago

Filter for unique individuals on the Screen

For our target application on SARS-CoV-2 we really care about showing that a variant is present somewhere in the dataset. We care less about its frequency, since if it's positively selected the variant can grow exponentially. In that regard, showing an extra row that has the exact same column occupancy as a previous row adds nothing to the visualization. This is getting towards the intent of Vertical Compression and "Show only Rearrangements" but at a lower level. "Hide redundant rows" could be a very useful toggle.

How would such a thing be implemented? HashSets come to mind. A hash that was based exclusively on the matrix content could serve as a key, while the value would be the list of rows that match. Then you only need one visible row per key and the mouseover "individual" could return a list of matches. Also, if you want to scale by frequency (again) you can just use the length of a value as the height multiplier.

Drawbacks: The main issue is that nucleotide positions will almost certainly not line up, so they'd be excluded from the hash, stored as a separate value in the HashSet or a second lookup table. This complicates the mouseover code, but it'd not impossible. The internal representation of this data structure is very different from the current, so this would be a lot of development hours.