SpotFinding in starfish

This issue pulls together a series of investigations into how to support spot-finding and decoding approaches used by SpaceTx labs. I've chronicled some proposed changes and open questions. It is a work in progress that would greatly benefit from user and developer feedback.

science cc @kevinyamauchi @berl @sofroniewn @gapartel tech cc @ttung @shanaxel42 @kne42

Addresses issues brought up in #380 #874 #373 #708, #499, #525, #759 and #908

Definitions

Spot: In the context of image-based transcriptomics, a spot is an ellipsoid blob that is the result of capturing light generated by a fluorophore that has been intentionally and specifically bound to an RNA molecule to visualize the molecule's spatial location.

Blob finding: blob finding is the process of detecting spots an n-dimensional image. Blob finding is accomplished by integrating the intensity of an image over a gaussian kernel, returning the coordinates of bright blobs on dark backgrounds. The shape and sign of the kernel preferentially detects ellipsoid blobs equal in width to the kernel.

Region of interest (ROI) detection: ROI detection is a more general approach that does not require the region to have a specific shape. In image-based transcriptomics there are two uses of ROI detection: instance-based segmentation of cells and spots. Spot ROI detection is currently accomplished in starfish by connecting adjacent pixels together when (1) the decode to the same gene and (2) they meet a minimum intensity criteria.

Blob finding

At present, blob finding is carried out by three methods in starfish:

BlobDetector. This method enables users to employ a Laplacian of Gaussians filter that enhances spots (bright circular objects with dark backgrounds) that match the standard deviation of the gaussian filter, and uses the skimage local_max_peak_finder approach on the Laplacian filtered image to identify the resulting spots that are brighter than a specified threshold value. It will attempt to find the standard deviation that best fits a given spot, selecting from a range of values passed by the user. It also enables users to approximate the Laplacian filtered image using a Difference of Gaussians or Determinant of Hessians approximation to save time.
LocalMaxPeakFinder: This method applies a Laplacian filter with a specific standard deviation to an image, testing numerous different thresholds. Like BlobDetector above, it uses local_max_peak_finder: to identify spots. The method selects an appropriate threshold by identifying a stable regime where modifying the threshold does not change the number of spots detected.
TrackpyLocalMaxPeakFinder: This method finds gaussian blobs exceeding a specific mass (integrated brightness of Gaussian kernel). Unlike the above approaches, it does not pre-process the data with a Laplacian of Gaussians to enhance spots. In addition, it uses an iterative refinement to identify center of mass for the blobs following the Gaussian integration, which improves the ability of this method to find atypically shaped spots with off-center mass. It also adds a series of filtering steps, including: total integrated brightness, removal of (spurious) close peaks resulting from flat blobs via a gray_dilation, The ability to specify return of the top_n peaks, a percentile filter similar to that used by BlobDetector, and (unused by our users) a bandpass pre-processing step to remove low and high frequency noise.

Redundancies in the current implementation

starfish's three spot finding methods each expose a peak_local_max call (or similar, in case of trackpy) over an optionally pre-filtered image (bandpass, laplacian of gaussians, difference of gaussians, determinant of hessians), enable a divergent set of parameter optimizations (sigma, min_intensity) and expose a variety of partially-overlapping filtering options over the results. Originally, these methods were included because:

BlobFinder was the first spot caller to be added. starfish needed to call spots!
A group wanted to be able to tune the threshold for a Laplacian of Gaussian blob finder similar to what was proposed by Arjun Raj (added LocalMaxPeakFinder).
These approaches needed to be done in 3d, which was not supported by BlobFinder at the time (added TrackpyLocalMaxPeakFinder).

Interaction of spot finding and decoding

Starfish aims to support a variety of assays that have different data characteristics. One important task shared by some assays is to match spots across tiles acquired at different times (round) or fluorescence wavelengths (channel) Some meaningful characteristics that influence the difficulty of this task are:

The distance between spots, or "crowding". Closer spots are harder to distinguish, and it can be necessary to leverage the sparse appearance of spos across rounds and channels to distinguish two partially overlapping spots (in some (round, channel) combinations. Often only one spot will be present, while when both are present, spots will not be distinguishable.
Alignment of the spots across tiles. Different assays have different amounts of drift of the sample during image acquisition, resulting in systematic (but not necessarily linear) shifts in the locations of the spots across tiles. In these cases, it is necessary to account for this with some combination of registration (whole sample shifting), nearest-neighbors search (local shifting), or blurring (to spread spots into larger areas). In particularly crowded cases (SeqFISH) it may be necessary to seed local search from different (round, channel) combinations, whereas some approaches are so well aligned that they do not even attempt to call spots in all tiles -- they simply measure the intensity at the location of spots identified in one tile in all subsequent tiles, saving significant computation time.

Proposed changes to Spot Detection

[ ] Identify the set of useful spot filters (note: not image filters) across the three spot finding approaches that are requested by our users, implement them in BlobFinder, then remove TrackpyLocalMaxPeakFinder which is otherwise redundant.
[ ] Add automated threshold finding to BlobFinder and remove LocalMaxPeakFinder, which is otherwise redundant.
[ ] Add ability to call spots in all rounds and build traces using KNN local search seeded from a single round (StarMAP use case)
[ ] (Stretch) Enable ability to seed traces from 3. in multiple rounds, select spots that are consistent across seeds (SeqFISH use case).

Implications for starfish's object model

The IntensityTable was built for the original ISS spot detection use case where a blob was detected in a maximum projection and the resulting blob centroid was measured across all (channel, round) tiles. As a result, it is designed to store intensities of a fixed (z, y, x) position. The use case where a local search around a blob from a seed tile is used to find points that are very close by in other (channel, round) combinations makes this object model imprecise, and for these assays the IntensityTable is a lossy structure when compared to a simple list of spots and their characteristics. When visualizing spots this is useful to demonstrate improperly registered data (the spot will be slightly offset in non-anchor rounds) but complicates effectively visualizing the data (same reason as above: spots will be "off"). In addition, there is still some need to retain the SpotAttributes object to store the results of decoding before they are merged with the NN approach.

Pixel-based ROI detection

Starfish currently implements a Pixel ROI detection routine that is designed to identify spots. It begins by reshaping the ImageStack into "pixel traces", which are (round, channel) arrays for each (z, y, x) pixel in the ImageStack. It decodes these pixel traces by comparing each trace to the codebook, which contains expected intensities for the same (round, channel) combinations. The data are then reshaped back into 3D (z, y, x) euclidean space, and pixels that decoded to the same code and exceed a specified minimum intensity are connected into ROIs. ROIs are then filtered based on their pixel count.

Clarifying ROI vs. Spot detection

Pixel-based ROI detection was initially posed as a spot detection approach. However, our implementation does not detect spots, so this is confusing. Conversations with @kevinyamauchi revealed that his excitement around pixel-based analyses was in its potential use as a filter that could convert images that exist in (channel, round, z, y, x) space to images that exist in (gene, z, y, x) space. These converted images could then be subjected to a spot caller, thus incorporating information both about the pixel intensities and the (useful) prior that intensities in our experiments should form spots. This filter would provide clearer peaks in the case of overlapping spots, as the overlap location would have moderate association with both codes, whereas the non-overlapping portion would strongly peak for each code. This would not solve problems where two molecules of the same transcript type overlap in space.

Gene-space decoding filter

Transforming coded (channel, round) combinations into (gene,) space is easy to accomplish by decoding each pixel. In the case of MERFISH, the data is first subjected to a 1-sigma Gaussian blur to spread information and make decoding less sensitive to local noise. This filter could be implemented in multiple ways which have very different memory considerations, depending on the codebook structure.

Option 1: "All genes" encode the distance/similarity of each pixel to each code. A one-hot encoded codebook of {"c": 4, "r": 4} with 16 gene targets takes up the same amount of memory in this encoding. However, a {"c": 2, "r" 8} codebook with 1000 targets (A MERFISH codebook) takes up 1000 / 16 times more memory, which is not practical.

Option 2: "Max genes" encode the maximum similarity across all codes, and store a second label_image that contains the ID of the most similar gene. This always requires only 2 tiles of memory, but loses the ability to compute over the probabilities across genes.

Proposed Changes

[ ] Implement both all-gene and max-gene outputs as options for a single decoding filter. Emit a warning in the all-gene case when forming the filter with a codebook that is > 2x the length of the pixel trace (would more than double memory consumption).
[ ] Refactor the existing spot detector based on connecting similarly decoding components to be an ROI detector that leverages the filters
[ ] Create a new spot detector that pre-filters and runs BlobDetector on the filtered image. [#TODO this requires understanding whether this is simply a filter, or a filter and the BlobDetector]

Learnings for visualization

It would be useful to enable users to construct codebooks that don't reflect true targets, in order to better visualize aberrant codes. Example: lipofuscion could be represented as a code that is "1" in each channel and round. IntensityTables can be decoded multiple times, and the resulting gene annotations can be used to produce separate marker layers. The lipofuscion layer could be used to create black spots that mask those spots across (round, channel) to make more clear which spots are unexplained by the current model.

[ ] Test napari + multiple marker sets in starfish.display.stack

Open Questions to be answered

[x] Is the acentric spot finding ability of trackpy important? Is skimage.feature.peak_local_max flexible enough to support the same blob types?
[ ] Is trackpy faster in 3d than the blog_dog method used in BlobFinder? If so, why?
[ ] Could we implement the per-gene filter such that it can be interleaved with the BlobFinder? This would enable us to compute over images that have been converted to gene space, which could be useful.
[ ] Is the current trade-off in IntensityTable between accuracy and data density acceptable? Do we want to adjust how that object is structured?
[ ] Better understand the work from labs attempting to solve the joint decoding / spot calling problem. This approach feels like a more rigorous solution to calling spots in crowded space (needed for SeqFISH).
[ ] Is there still a requirement for a SpotFinderResult class as proposed in #908? In this new world, the "max-gene" filter would produce an additional result, although the "all gene" one would not require this. How would this extra information be used in downstream starfish components?
[ ] Ensure LoG filter is only being run once in the LocalMaxPeakFinder threshold finding method (only peak_local_max should be necessary to call multiple times (but even that could be optimized!)
[ ] Is there a problem if we decode the an ImageStack in (round, channel, z, y, x) space into (gene, z, y, x) space? One could imagine there is a simple solution where channel is used as a proxy for gene and a (4, 4, z, y, x) array (like ISS) combined with a (codes=16, r=4, c=4) codebook would produce a (1, 16, z, y, x) decoded ImageStack.
[ ] Which proposed changes should be made in starfish, and which should be proposed to skimage?

spacetx / starfish

Unifying Spot Finding #1005