Closed ambrosejcarr closed 5 years ago
This may also address some open concerns in #797 that are related to passing state around in SpotFinder
objects.
I don't fully understand the "max genes" approach for gene-space decoding filter.
I suspect we only need to store the N nearest neighbors for each pixel for gene-space decoding. The remaining genes are probably inconsequential.
This WIP PR goes into the two options in more detail: https://github.com/spacetx/starfish/pull/1060
If you read that and are still confused, let me know.
SpotFinding in starfish
This issue pulls together a series of investigations into how to support spot-finding and decoding approaches used by SpaceTx labs. I've chronicled some proposed changes and open questions. It is a work in progress that would greatly benefit from user and developer feedback.
science cc @kevinyamauchi @berl @sofroniewn @gapartel tech cc @ttung @shanaxel42 @kne42
Addresses issues brought up in #380 #874 #373 #708, #499, #525, #759 and #908
Definitions
Spot: In the context of image-based transcriptomics, a spot is an ellipsoid blob that is the result of capturing light generated by a fluorophore that has been intentionally and specifically bound to an RNA molecule to visualize the molecule's spatial location.
Blob finding: blob finding is the process of detecting spots an n-dimensional image. Blob finding is accomplished by integrating the intensity of an image over a gaussian kernel, returning the coordinates of bright blobs on dark backgrounds. The shape and sign of the kernel preferentially detects ellipsoid blobs equal in width to the kernel.
Region of interest (ROI) detection: ROI detection is a more general approach that does not require the region to have a specific shape. In image-based transcriptomics there are two uses of ROI detection: instance-based segmentation of cells and spots. Spot ROI detection is currently accomplished in starfish by connecting adjacent pixels together when (1) the decode to the same gene and (2) they meet a minimum intensity criteria.
Blob finding
At present, blob finding is carried out by three methods in starfish:
BlobDetector
. This method enables users to employ a Laplacian of Gaussians filter that enhances spots (bright circular objects with dark backgrounds) that match the standard deviation of the gaussian filter, and uses the skimagelocal_max_peak_finder
approach on the Laplacian filtered image to identify the resulting spots that are brighter than a specified threshold value. It will attempt to find the standard deviation that best fits a given spot, selecting from a range of values passed by the user. It also enables users to approximate the Laplacian filtered image using a Difference of Gaussians or Determinant of Hessians approximation to save time.LocalMaxPeakFinder
: This method applies a Laplacian filter with a specific standard deviation to an image, testing numerous different thresholds. LikeBlobDetector
above, it useslocal_max_peak_finder
: to identify spots. The method selects an appropriate threshold by identifying a stable regime where modifying the threshold does not change the number of spots detected.TrackpyLocalMaxPeakFinder
: This method finds gaussian blobs exceeding a specific mass (integrated brightness of Gaussian kernel). Unlike the above approaches, it does not pre-process the data with a Laplacian of Gaussians to enhance spots. In addition, it uses an iterative refinement to identify center of mass for the blobs following the Gaussian integration, which improves the ability of this method to find atypically shaped spots with off-center mass. It also adds a series of filtering steps, including: total integrated brightness, removal of (spurious) close peaks resulting from flat blobs via agray_dilation
, The ability to specify return of thetop_n
peaks, a percentile filter similar to that used byBlobDetector
, and (unused by our users) a bandpass pre-processing step to remove low and high frequency noise.Redundancies in the current implementation
starfish's three spot finding methods each expose a
peak_local_max
call (or similar, in case oftrackpy
) over an optionally pre-filtered image (bandpass
,laplacian of gaussians
,difference of gaussians
,determinant of hessians
), enable a divergent set of parameter optimizations (sigma
,min_intensity
) and expose a variety of partially-overlapping filtering options over the results. Originally, these methods were included because:BlobFinder
was the first spot caller to be added. starfish needed to call spots!LocalMaxPeakFinder
).BlobFinder
at the time (addedTrackpyLocalMaxPeakFinder
).Interaction of spot finding and decoding
Starfish aims to support a variety of assays that have different data characteristics. One important task shared by some assays is to match spots across tiles acquired at different times (
round
) or fluorescence wavelengths (channel
) Some meaningful characteristics that influence the difficulty of this task are:(round, channel)
combinations. Often only one spot will be present, while when both are present, spots will not be distinguishable.(round, channel)
combinations, whereas some approaches are so well aligned that they do not even attempt to call spots in all tiles -- they simply measure the intensity at the location of spots identified in one tile in all subsequent tiles, saving significant computation time.Proposed changes to Spot Detection
BlobFinder
, then removeTrackpyLocalMaxPeakFinder
which is otherwise redundant.BlobFinder
and removeLocalMaxPeakFinder
, which is otherwise redundant.Implications for starfish's object model
The IntensityTable was built for the original ISS spot detection use case where a blob was detected in a maximum projection and the resulting blob centroid was measured across all
(channel, round)
tiles. As a result, it is designed to store intensities of a fixed(z, y, x)
position. The use case where a local search around a blob from a seed tile is used to find points that are very close by in other(channel, round)
combinations makes this object model imprecise, and for these assays theIntensityTable
is a lossy structure when compared to a simple list of spots and their characteristics. When visualizing spots this is useful to demonstrate improperly registered data (the spot will be slightly offset in non-anchor rounds) but complicates effectively visualizing the data (same reason as above: spots will be "off"). In addition, there is still some need to retain theSpotAttributes
object to store the results of decoding before they are merged with the NN approach.Pixel-based ROI detection
Starfish currently implements a Pixel ROI detection routine that is designed to identify spots. It begins by reshaping the
ImageStack
into "pixel traces", which are(round, channel)
arrays for each(z, y, x)
pixel in theImageStack
. It decodes these pixel traces by comparing each trace to thecodebook
, which contains expected intensities for the same(round, channel)
combinations. The data are then reshaped back into 3D(z, y, x)
euclidean space, and pixels that decoded to the same code and exceed a specified minimum intensity are connected into ROIs. ROIs are then filtered based on their pixel count.Clarifying ROI vs. Spot detection
Pixel-based ROI detection was initially posed as a spot detection approach. However, our implementation does not detect spots, so this is confusing. Conversations with @kevinyamauchi revealed that his excitement around pixel-based analyses was in its potential use as a filter that could convert images that exist in
(channel, round, z, y, x)
space to images that exist in(gene, z, y, x)
space. These converted images could then be subjected to a spot caller, thus incorporating information both about the pixel intensities and the (useful) prior that intensities in our experiments should form spots. This filter would provide clearer peaks in the case of overlapping spots, as the overlap location would have moderate association with both codes, whereas the non-overlapping portion would strongly peak for each code. This would not solve problems where two molecules of the same transcript type overlap in space.Gene-space decoding filter
Transforming coded
(channel, round)
combinations into(gene,)
space is easy to accomplish by decoding each pixel. In the case of MERFISH, the data is first subjected to a 1-sigma Gaussian blur to spread information and make decoding less sensitive to local noise. This filter could be implemented in multiple ways which have very different memory considerations, depending on the codebook structure.Option 1: "All genes" encode the distance/similarity of each pixel to each code. A one-hot encoded codebook of
{"c": 4, "r": 4}
with16
gene targets takes up the same amount of memory in this encoding. However, a{"c": 2, "r" 8}
codebook with 1000 targets (A MERFISH codebook) takes up1000 / 16
times more memory, which is not practical.Option 2: "Max genes" encode the maximum similarity across all codes, and store a second label_image that contains the ID of the most similar gene. This always requires only 2 tiles of memory, but loses the ability to compute over the probabilities across genes.
Proposed Changes
BlobDetector
on the filtered image. [#TODO this requires understanding whether this is simply a filter, or a filter and theBlobDetector
]Learnings for visualization
It would be useful to enable users to construct codebooks that don't reflect true targets, in order to better visualize aberrant codes. Example: lipofuscion could be represented as a code that is "1" in each channel and round. IntensityTables can be decoded multiple times, and the resulting gene annotations can be used to produce separate marker layers. The lipofuscion layer could be used to create black spots that mask those spots across
(round, channel)
to make more clear which spots are unexplained by the current model.starfish.display.stack
Open Questions to be answered
skimage.feature.peak_local_max
flexible enough to support the same blob types?trackpy
faster in 3d than theblog_dog
method used inBlobFinder
? If so, why?BlobFinder
? This would enable us to compute over images that have been converted to gene space, which could be useful.IntensityTable
between accuracy and data density acceptable? Do we want to adjust how that object is structured?SpotFinderResult
class as proposed in #908? In this new world, the "max-gene" filter would produce an additional result, although the "all gene" one would not require this. How would this extra information be used in downstream starfish components?LocalMaxPeakFinder
threshold finding method (onlypeak_local_max
should be necessary to call multiple times (but even that could be optimized!)(round, channel, z, y, x)
space into(gene, z, y, x)
space? One could imagine there is a simple solution wherechannel
is used as a proxy forgene
and a(4, 4, z, y, x)
array (like ISS) combined with a(codes=16, r=4, c=4)
codebook would produce a(1, 16, z, y, x)
decodedImageStack
.skimage
?