Batch inference issues when ROI close to image edge

xalexalex commented 1 year ago

When the ROI for inference is close (or touches) the image edge, I get the following error:

I can reproduce it on both the OpenSlide and BioFormats backends. With the new version of the wsinfer extension (v0.3.0), if I set a batch size = 1, I don't get the error but whenever inference encounters a tile on the edge, it is markedly slower (I can see the hiccups in the progress bar: at first once every few batches [because only some batches will end with a bottom-edge tile] and then for each batch [since on the last column, each batch has an edge tile]).

With batch size >1, if after the error I look at the detections, I can see detections up to and excluding the first batch containing an edge tile.

example video showing the behavior with batch size 4 and then 1.

kaczmarj commented 1 year ago

thanks for the bug report. the video is very helpful!

i wonder if the hiccups are caused by padding the images to get them to the required size.

but this also brings up a concern for me. the images should be 224x224 after resizing. how is an image of 188x188 attempting to be batched with a 224x224 image, if all of the images should be resized to 224x224? is the 188x188 not undergoing the resizing for some reason? (but of course, a truncated patch should not be upscaled to 224x224, because the physical spacing will be wrong).

also i am wondering why the image is a square (188x188) as opposed to a rectangle where one direction is shorter than the other direction. (although 188x188 could be the bottom right corner patch).

if this isn't already done, we should decide how to pad the images. we could pad them with some constant color (like white). or we could mirror the patch. we could also exclude the patch if it extends past the boundaries of the slide. thoughts everyone?

by the way @xalexalex - how do you record your videos? i would like to do something similar :)

xalexalex commented 1 year ago

i wonder if the hiccups are caused by padding the images to get them to the required size.

I thought the same thing, but it seems to take a bit too much time for a simple padding. I suspect something else is going on.

also i am wondering why the image is a square (188x188) as opposed to a rectangle where one direction is shorter than the other direction. (although 188x188 could be the bottom right corner patch).

Good catch. Are we perhaps assuming that width == height? In the case in the video, the tile that triggers the error was a bottom-edge tile, so width > height; it was definitely not the bottom-right corner tile.

if this isn't already done, we should decide how to pad the images. we could pad them with some constant color (like white). or we could mirror the patch. we could also exclude the patch if it extends past the boundaries of the slide. thoughts everyone?

I wil enumerate possible solutions from quick & hacky to thoughtful & expensive:

simply discard the edge tile (might be a temporary quickfix.)
if ROI intersects image boundary, tile from the image boundary backwards, so that the last tile is level with the image boundary. if ROI spans the whole width (or height) of the image and thus you can't avoid this problem on both sides, fall back to (1).
pad with something standard, e.g. white (or qupath's white for the given image)
leave the choice to the user in the config.json of each model, with (3) being the default

I would vote for 1. At most 2. But I think 1 will solve this problem quickly and nobody will ever complain. In my WSIs edge tiles are always uninformative and this error is simply an hindrance.

by the way @xalexalex - how do you record your videos? i would like to do something similar :)

This is going to be very crude, but:

open a terminal
sleep 1; ffmpeg -video_size 1920x1080 -framerate 25 -f x11grab -i :0.0 output.mp4
quickly switch to what you want to record
when you're done, go back to the terminal and CTRL-C

xalexalex commented 1 year ago

Quick update: the error also happens on the left and top edges. So solution (1) seems to be the quickest way to fix this, whereas (2) is not doable until we understand where the actual problem is.

petebankhead commented 1 year ago

if this isn't already done, we should decide how to pad the images. we could pad them with some constant color (like white). or we could mirror the patch. we could also exclude the patch if it extends past the boundaries of the slide. thoughts everyone?

Does the WSInfer Python code have a strategy for this?

To clarify: the QuPath implementation of inference is completely independent. It should agree with whatever is done in Python as much as possible for consistency, but it is difficult to guarantee identical results because some core operations might be implemented differently (e.g. the precise interpolation used when resizing tiles, which can make a big different).

petebankhead commented 1 year ago

This PR addresses this by using zero-padding: https://github.com/qupath/qupath-extension-wsinfer/pull/47

That is what should have been happening already... I just missed the bug because I was restricted to a batch size on 1 on my Mac (which is no longer a restriction).

Other boundary criteria could be considered is WSInfer handles it differently in Python.

I also added a comment where the tile resizing is applied:

// For example, using the Python WSInfer 0.5.0 output for the image at
// https://github.com/qupath/qupath-docs/issues/89 (30619 tiles):
//  BufferedImageTools Tumor prob Mean Absolute Difference: 0.0026328298250342763
//  OpenCV Tumor prob Mean Absolute Difference:             0.07625036735485102

Basically, the method of interpolation makes a difference in how well the Python and QuPath implementations agree. I believe Python uses bilinear interpolation, but the results quoted above both use bilinear interpolation, just implemented differently and this is enough to cause disagreements.

I think perfect agreement between Python and QuPath would be very difficult to achieve (and require some substantial changes), but this figure gives some idea of the difference. If I use something other than bilinear interpolation in QuPath, I see much larger disagreements.

kaczmarj commented 1 year ago

Does the WSInfer Python code have a strategy for this?

wsinfer python pads patches with 0. actually this is an implementation detail of opensldie and tiffslide (they will pad with 0 if the patch is at the edge of a slide). i should add tests to wsinfer python that makes sure this continues to happen with future versions.

I think perfect agreement between Python and QuPath would be very difficult to achieve

i agree, and i believe it shouldn't be our goal to achieve perfect agreement. by the way, the bilinear resampling in wsinfer/python is performed by Pillow.

xalexalex commented 1 year ago

@petebankhead I tested the current HEAD and unfortunately the issue isn't fixed for me. Could anyone else check?

I get this error:

Successful run without edge tiles:

19:39:56.648 [wsinfer1] [INFO ] qupath.ext.wsinfer.WSInfer - Running prost-latest for 80 tiles
19:39:58.730 [wsinfer1] [INFO ] qupath.ext.wsinfer.WSInfer - Finished 80 tiles in 2 seconds (26 ms per tile)

Run that errors out on edge tile:

Nov 13, 2023 7:40:03 PM javafx.fxml.FXMLLoader$ValueElement processValue
WARNING: Loading FXML document with JavaFX API of version 20.0.1 by JavaFX runtime of version 20
19:40:03.876 [wsinfer1] [WARN ] ai.djl.repository.SimpleRepository - Simple repository pointing to a non-archive file.
19:40:03.978 [wsinfer1] [INFO ] qupath.ext.wsinfer.WSInfer - Running prost-latest for 75 tiles
19:40:04.192 [wsinfer-tiles1] [WARN ] qupath.ext.wsinfer.TileLoader - Detected out-of-bounds tile request - results may be influenced by padding (12189, -51, 963, 963)
19:40:04.216 [wsinfer-tiles2] [WARN ] qupath.ext.wsinfer.TileLoader - Detected out-of-bounds tile request - results may be influenced by padding (14115, -51, 963, 963)
19:40:04.219 [wsinfer-tiles4] [WARN ] qupath.ext.wsinfer.TileLoader - Detected out-of-bounds tile request - results may be influenced by padding (13152, -51, 963, 963)
19:40:04.294 [wsinfer-tiles1] [WARN ] qupath.ext.wsinfer.TileLoader - Detected out-of-bounds tile request - results may be influenced by padding (21819, -51, 963, 963)
19:40:04.519 [wsinfer-tiles2] [WARN ] qupath.ext.wsinfer.TileLoader - Detected out-of-bounds tile request - results may be influenced by padding (22782, -51, 963, 963)
19:40:04.707 [wsinfer-tiles3] [WARN ] qupath.ext.wsinfer.TileLoader - Detected out-of-bounds tile request - results may be influenced by padding (23745, -51, 963, 963)
19:40:04.740 [wsinfer-tiles3] [WARN ] qupath.ext.wsinfer.TileLoader - Detected out-of-bounds tile request - results may be influenced by padding (24708, -51, 963, 963)
19:40:04.808 [wsinfer1] [ERROR] qupath.ext.wsinfer.WSInfer - Error running model prost-latest
ai.djl.translate.TranslateException: java.lang.IllegalArgumentException: You cannot batch data with different input shapes(3, 963, 963) vs (3, 224, 224)
        at ai.djl.inference.Predictor.batchPredict(Predictor.java:193)
        at qupath.ext.wsinfer.WSInfer.runInference(WSInfer.java:243)
        at qupath.ext.wsinfer.ui.WSInferController$WSInferTask.call(WSInferController.java:553)
        at qupath.ext.wsinfer.ui.WSInferController$WSInferTask.call(WSInferController.java:499)
        at javafx.concurrent.Task$TaskCallable.call(Task.java:1426)
        at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.base/java.lang.Thread.run(Unknown Source)
Caused by: java.lang.IllegalArgumentException: You cannot batch data with different input shapes(3, 963, 963) vs (3, 224, 224)
        at ai.djl.translate.StackBatchifier.batchify(StackBatchifier.java:83)
        at ai.djl.inference.Predictor.processInputs(Predictor.java:300)
        at ai.djl.inference.Predictor.batchPredict(Predictor.java:181)
        ... 8 common frames omitted
Caused by: ai.djl.engine.EngineException: stack expects each tensor to be equal size, but got [3, 224, 224] at entry 0 and [3, 963, 963] at entry 3
        at ai.djl.pytorch.jni.PyTorchLibrary.torchStack(Native Method)
        at ai.djl.pytorch.jni.JniUtils.stack(JniUtils.java:626)
        at ai.djl.pytorch.engine.PtNDArrayEx.stack(PtNDArrayEx.java:662)
        at ai.djl.pytorch.engine.PtNDArrayEx.stack(PtNDArrayEx.java:33)
        at ai.djl.ndarray.NDArrays.stack(NDArrays.java:1825)
        at ai.djl.ndarray.NDArrays.stack(NDArrays.java:1785)
        at ai.djl.translate.StackBatchifier.batchify(StackBatchifier.java:54)
        ... 10 common frames omitted

kaczmarj commented 1 year ago

thanks @xalexalex

it appears that these patches are not being resized for some reason... i'm not sure what would cause this.

ai.djl.translate.TranslateException: java.lang.IllegalArgumentException: You cannot batch data 
with different input shapes(3, 963, 963) vs (3, 224, 224)

petebankhead commented 1 year ago

This works for me with the zoo models on both Windows and Mac.

@xalexalex can you specify which model you are using, or share the config.json?

Two explanations I can think of:

You've still got an 'old' version of the extension installed, and QuPath is using it instead
All the zoo models contain a 'resize' transform in the config.json... which may be required in the current implementation

@kaczmarj I notice that https://huggingface.co/kaczmarj/pancancer-lymphocytes-inceptionv4.tcga/blob/main/config.json contains a patch size of 100 but resizes to 299... is this intended?

kaczmarj commented 1 year ago

@kaczmarj I notice that https://huggingface.co/kaczmarj/pancancer-lymphocytes-inceptionv4.tcga/blob/main/config.json contains a patch size of 100 but resizes to 299... is this intended?

i realize it looks odd, but it is intended. i triple-checked the original implementation (https://github.com/ShahiraAbousamra/til_classification).

xalexalex commented 11 months ago

2. All the zoo models contain a 'resize' transform in the `config.json`... which may be required in the current implementation

This was indeed true. Adding a resize transform in config.json did the trick. Now both zoo models and my custom models work flawlessly. Thanks!

kaczmarj commented 11 months ago

This was indeed true. Adding a resize transform in config.json did the trick. Now both zoo models and my custom models work flawlessly. Thanks!

fantastic! glad it is working

qupath / qupath-extension-wsinfer

Batch inference issues when ROI close to image edge #45