hms-dbmi / viv

Library for multiscale visualization of high-resolution multiplexed bioimaging data on the web. Directly renders Zarr and OME-TIFF.
http://avivator.gehlenborglab.org
MIT License
282 stars 45 forks source link

image load failures due to bad abort support in geotiff #604

Open acthp opened 2 years ago

acthp commented 2 years ago

Describe the bug Image tile loading fails after some use of the UI. Some tiles fail to load, and zoom/pan does not resolve the issue.

I can avoid the bug from viv by removing the abort signal in the calls to geotiff. This might be a good workaround for viv until geotiff is fixed. However, the blockedsource.js file in geotiff looks extremely broken. I suspect there are several more race conditions due to incorrect use of async/await, but the abort code is the easiest one to hit from avivator.

To Reproduce Load ome tiff in avivator. Load a new channel. While channel is loading, zoom and pan the image. Eventually geotiff will throw an AggregateError, and some tiles will never load.

Expected behavior Tile loading continues to work

Screenshots

Environment:

ilan-gold commented 2 years ago

I am unable to reproduce. Are you using the demo site? This was a known issue (to me) but was able to fix it by updating geotiff (we contribute to that repo). Could you provide a screen video or something?

ilan-gold commented 2 years ago

Specifically, if you're not using the demo site, make sure to set up cacheSize: 0 as documented - I've seen this error before when not setting this.

acthp commented 2 years ago

Here's a screencast of getting the AggregateError on the demo site.

avivator-error

Image link https://avivator.gehlenborglab.org/?image_url=https://xenapublicimages.s3.amazonaws.com/MEL08-1-1.ome.tif

ilan-gold commented 2 years ago

I will look into this then. I suspect it has to do with a recent PR we made there. This used to be a problem but I was nearly 100% certain I had fixed it in a local copy of geotiff we had - we then merged that copy into the real thing but it's definitely possible something was lost in translation. Yes, that section of the code is a mess - any help there would be appreciated. It's not completely unapproachable, just takes some debugging.

ilan-gold commented 2 years ago

As an aside, I think there may be a CORS issue with Firefox and your bucket.

EDIT: I am actually having trouble even with localhost i.e the local dev version of Viv....

acthp commented 2 years ago

Regarding blockedsource.js, the code looks very confused, but maybe I don't understand the intent. So, my analysis could be wrong.

The first major issue is that it inspects the cache, then releases the cpu (by use of await), then pulls from the cache. This can't work reliably because the cache is dynamic, it changes over time. The values cached before await may not be there after await. Every use of await introduces another race condition. The code should instead pull what it can from the cache at the top of fetch(), and hold on to those values. That should be the only use of cache in fetch(). It should hold on to the requests it makes, rather than expecting them to appear in cache later, because, again, the cache is dynamic: every addition to cache can remove something from cache, so you can't wait on multiple things and then expect them to all be there.

The second issue is the abort code. I suspect the intent of the API is that the same tile will not be fetched twice if requested more than once in a short time interval. Without reference counting the number of requests on the same tile, I don't believe there's a reliable way to support abort, because with no count there's no way to know if the tile is still needed. The current code attempts to work around this by retrying aborted tiles, but this introduces races on this.abortedBlockIds.

I suspect the desired functionality requires many fewer state variables than the current implementation. I could try reworking it, but I would probably do a clean rewrite rather than trying to fix what's there.

ilan-gold commented 2 years ago

I would be open to that @acthp - I held some similar suspicions of the code but could not have wrote it clearly like you did. Thanks so much - if you don't implement it, I will probably take a stab at it but community help is always appreciated!

As for the abort code, this comes from deck.gl (although I helped, it was not something I requested). The intent is less fetching the same tile twice than cancelling tiles that have been "passed through" as you zoom in. I don't think the same tile is requested twice in general use cases (and if it has been loaded, is cached by deck.gl) although I suppose if you zoom in and out on the same tile, this could happen where the first request is cancelled - this wasn't really the intention for the feature I imagine (since this isn't too common an operation).

acthp commented 2 years ago

Ok, I'll see what I can put together in the next few days.

Sorry about the CORS issue on that image. If you want to test it I believe localhost:8080 is allowed. I can open it up more if you need it.

ilan-gold commented 2 years ago

Ah, we use port 3000 for development. I'll circle back to this on Friday to check in. Thanks!

ilan-gold commented 2 years ago

Have you had a chance to look at this @acthp? Otherwise I'll look at it early next week.

acthp commented 2 years ago

Yes, though I'm mostly focused on writing tests atm. I think there are three groups of bugs: races on the cache, races on the abort, and a suspected off-by-one that leads to an exception when the request range matches the block size. Trying to shore up the tests to make sure there's not more.

ilan-gold commented 2 years ago

Amazing, thank you! The last one sounds horrifying, been in a similar place with this library!

Dev-Lan commented 1 year ago

Hi all, I'm curious what the current state of this issue is? I am working on using parts of VIV code (specifically the loader/ImageLayer) and I've run into a similar issue that I currently suspect has the same root cause.

If it is helpful I can share/explain what I am running into.

acthp commented 1 year ago

@Dev-Lan I rewrote the fetch code in geotiff, which resolved the various races we were experiencing. Code is here:

https://github.com/ucscXena/geotiff.js/commits/ucsc-beta

I haven't kept it up to date. I may come back to it in the future, but currently it's not clear if we will continue with viv, due to performance concerns.

ilan-gold commented 1 year ago

@Dev-Lan Could you expand on the issue? @acthp Could you also expand on your performance concerns, even if you are using a fork? Is this about tiff or the library? We do support OME-NGFF.

I am wondering if it is time to discuss a plan for "adopting" the geotiff.js project. I know there are other folks out there who are interested in this project but also find it problematic because the lead developer responds extremely slowly to things.

ilan-gold commented 1 year ago

There was also https://github.com/geotiffjs/geotiff.js/pull/336 which may help alleviate your issues. Have you tried this? It purports to solve some issues.

ilan-gold commented 1 year ago

This shouldn't be a fix, because we should be setting the cache size to be infinite, but maybe it has a side-effect?

acthp commented 1 year ago

@ilan-gold Re: performance, I haven't been in this code for awhile so my recollection is probably wrong in places. But as I remember it, the main issue we've had is lag in rendering tiles. We've also had problems with the main thread being blocked, freezing the whole UI.

When I looked into it, it appeared from back-of-the-envelope calculations that it was fetching a lot more data than it should need to render a given view. It canceled requests much less frequently than I would expect. For example, when zooming and panning rapidly I would expect to see cancellations for views that were merely passed through, but I rarely saw cancelled requests. So, most of the time was spent waiting for tiles that weren't used. To bring the data sizes down we tried some compressed formats, but that would lock up the UI for seconds at a time. It appears that the decompress is happening on the main thread. Also, the javascript decompression performance was poor, and pretty much negated any potential gain from the compression.

While troubleshooting these issues I felt that the code complexity was quite a bit beyond what was necessary for the problem, which made performance analysis difficult. Some of that is inherited from the deckgl and geotiff APIs. E.g. if I recall correctly the tile requests are driven from deckgl, so analysis of tile fetch and cancel require digging into that code. Geotiff implements BlockedSource, but the underlying image is already tiled, so there are two different chunking mechanisms working on top of each other. I'm not sure this is necessary.

We think it might be easier for us to build a system that has just a few abstractions, and fetches tiles that are individual, pre-computed, compressed image files so the browser can render them natively in the C++. If we do try to continue rendering directly from the pyramid-style image files, at minimum we would put the decompress on a separate thread, and would likely move the code to webassembly.

Again, sorry for any inaccuracies. I hope to get back to this in a few weeks.

Dev-Lan commented 1 year ago

I'm experiencing two of these, though potentially exposed in a slightly different way. I have a tiff stack of low resolution time-varying data. Panning/zooming does not trigger any fetches, but scrubbing through time does.

We've also had problems with the main thread being blocked, freezing the whole UI.

Right now, this is my biggest issue. This is somewhat random, but can be triggered by 2 quick updates causing 2 fetches for images. The whole page becomes unresponsive.

I would expect to see cancellations for views that were merely passed through

This is a smaller issue since I can work around it with the UI, but was surprising. Again, if I scrub through say 5 frames quickly, if deckgl or fetching can't keep up I would expect subsequent changes to abort/override previous changes, and get a single visual update with the 5th frame. Instead, when I'm done scrubbing all 5 frames show in sequence.

Like I said, this is fixable with a debounce or throttle, but I did try adding imageLayer.value?.state?.abortController?.abort(); before I make a new deckgl.setProps call. This did result in the UI behavior I was expecting, though it triggered new console errors and did not fix the crash issue.

It is also definitely possible I'm using something wrong or at least unexpectedly. I'm using vue, and have had the most success so far using plain js deckgl, and using the loaders and I'm layer classes from viv.

My code is at https://github.com/visdesignlab/aardvark. 90% of the relevant code is at https://github.com/visdesignlab/aardvark/blob/main/src/components/ImageViewer.vue, and a little bit is in https://github.com/visdesignlab/aardvark/blob/main/src/stores/imageViewerStore.ts

I can also provide videos or meet virtually if that would be useful in any way.

Dev-Lan commented 1 year ago

Hi again! I'm getting back into my project using this code. I was wondering if there have been any developments on this issue? As is I can still easily cause the whole page to freeze from loading subsequent frames too quickly.

I have tried swapping out the geotiff code from the fork in the previous comment, but haven't been successful in getting it to build.

  1. I've tried using "overrides" in my package.json — this doesn't work for monorepos.
  2. I've made a fork of viv with the updates and tried pointing to that github repo directly, I couldn't get this to build either.
  3. I've tried just referencing the subpackage of my fork for the loaders package using (https://gitpkg.vercel.app/), this is failing to import the geotiff from the github fork.
ilan-gold commented 1 year ago

As I see it there are two issues:

  1. There is a problem where doing things "too quickly" causes outright errors. This can be solved by using (from what I remember) https://github.com/ucscXena/geotiff.js/commits/ucsc-beta which should probably be merged with the main branch of geotiff. However the maintainer of that repo can be a bit spotty. I would not be opposed to moving on to a fork again but we would need to collectively test this out.
  2. Interactions that are too quick cause lags/delays/incorrect loading, but not errors. This likely comes from deck.gl. So if you want that to improve, I think that is the place to look into things.

Something you can do is use geotiff's Pool to do decoding on a worker thread if you are building your own application and can figure out the bundling (if I recall that is why it is disabled in Avivator). It should be documented in our API. As to how to do that.

What @acthp suggests in his last comment (C++ with precomputed tiles) sort of goes against what this repo does so if you really need that performance, then that might be the route to go. But if you're pre-computing compressed tiles, I suspect that our repo's performance will actually be fine since the computation cost of decompression will go down substantially as to not break anything.

ilan-gold commented 1 year ago

P.S @Dev-Lan if you're issue is just bundling a geotiff fork, we can try to help but will need a small reproducible example.

Dev-Lan commented 1 year ago

Right now, I'm just trying to fix the first issue. I certainly would appreciate any advice if you have it regarding the bundling.

Of my three approaches described above, I think 1 is a dead-end. For both option 2 and 3 I am using a fork I created from VIV that references the ucsc geotiff fork.

Option 2: load viv monorepo from my fork Steps: 1.git clone -b test-geotiff-fork-full https://github.com/visdesignlab/aardvark.git

  1. yarn install
  2. yarn run dev
  3. ERROR: ✘ [ERROR] Failed to resolve entry for package "@hms-dbmi/viv". The package may have incorrect main/module/exports specified in its package.json. [plugin vite:dep-scan]

If I look at the viv package in node_modules it is not being built the same way as if I get it from npm. It is clearly not correct as is, but I'm not sure what is needed to get it to build correctly from github.

Option 3: get subpackage for loaders only. Steps:

  1. git clone -b test-geotiff-fork https://github.com/visdesignlab/aardvark.git
  2. yarn install
  3. yarn run dev
  4. ERROR: ✘ [ERROR] Failed to resolve entry for package "geotiff". The package may have incorrect main/module/exports specified in its package.json. [plugin vite:dep-pre-bundle]

Let me know if any additional info would be helpful.

acthp commented 1 year ago

I don't currently have plans to continue with this. Regarding #2, in addition to deck.gl over-fetching, we also saw over-fetching due to inappropriate tile sizes. All of the images we've tested so far have required rebuilding due to various encoding errors, e.g. mixing up dimensions, botching the tiles, etc. This is pretty typical for research data, and it undermines the utility of being able to directly render from these formats. So, we're building conventional pyramids of png files during our image data ingest, and rendering with deck.gl. I will update this thread if I'm able to reduce the deck.gl over-fetching.

re: build, I had to fork viv to inject the patched geotiff. Diffs are here: https://github.com/hms-dbmi/viv/compare/master...ucscXena:viv:ucscxena

There's probably a better way, but I don't know what it is.

Dev-Lan commented 1 year ago

re: build, I had to fork viv to inject the patched geotiff. Diffs are here: master...ucscXena:viv:ucscxena

Thanks for the link, this is a helpful reference! I'll give this another shot.

Edit: So far no luck :'(

Dev-Lan commented 1 year ago

I ended up just making a local copy of just the loaders package from viv and updating it to use ucsc-xena-geotiff instead of geotiff.

This does help, but I can still trigger a "hard crash" of the webpage, where the page becomes unresponsive, and I need to close the tab.

Dev-Lan commented 1 year ago

Alright, "hopefully," last update. I found something that appears to have resolved my issue. I think @ilan-gold's suggestion for using a pool finally clicked. Specifying a pool in the options when I call loadMultiTiff seems to fix the issue.

import { Pool } from 'geotiff';

...

const loader = await loadMultiTiff(
    ...,
    { pool: new Pool() }
);
ilan-gold commented 1 year ago

Yay!

keller-mark commented 1 year ago

@ilan-gold is passing a pool value recommended? I notice we are not doing this in Vitessce - should we be?

ilan-gold commented 1 year ago

I think perhaps the reason for this was the workers on the heatmap but that is something of a moot point now since it almost certainly will not need to use workers anymore for spatial data (heatmaps for spatial data tend to be smaller).