spacetx / starfish

starfish: unified pipelines for image-based transcriptomics
https://spacetx-starfish.readthedocs.io/en/latest/
MIT License
221 stars 67 forks source link

Updated seqFISH Decoding Method: CheckAll Decoder #1978

Closed nickeener closed 1 year ago

nickeener commented 2 years ago

This is an update to my previous PR for a new seqFISH decoding method. We discovered some problems with the method in regards to the false positive rate and have spent the past several months improving it to it's current state and think it is now fit to be published.

So recapping the previous PR, we found that starFISH's seqFISH decoding (PerRoundMaxChannel with NEAREST_NEIGHBOR or EXACT_MATCH TraceBuildingStrategy) was finding far fewer targets in our seqFISH datasets than we were expecting from previous results and the reason for this is the limited search space within which starFISH looks for barcodes. For both TraceBuildingStrategy's, spots in different rounds are only connected into barcodes if they are spatially close to a spot in an arbitrary anchor round which does not utilize the inherent error correction that can be done by comparing the barcodes found using each round as an anchor. Additionally, the EXACT_MATCH TraceBuildingStrategy requires spots in different rounds to be in the exact same pixel location in order to connected into barcodes while NEAREST_NEIGHBOR requires that they be nearest neighbors, which are both extremely sensitive to spot drift between rounds, requiring precise registration that isn't always possible. The PerRoundMaxChannel decoder also lacks the ability to find error-corrected barcodes (in codebook where all codes have at least a hamming distance of 2 each code can still be uniquely identified by a barcode that is missing one round).

To address this, I attempted to write a seqFISH decoding method that emulates the one used by the lab where seqFISH originated at Caltech. In this method, the spots in all rounds are separately used as anchors to search to spots within a search radius of the anchor spots. So for each spot, all the possible barcodes that could constructed using the spots in other rounds within the search radius are considered, which is a much larger search space than when only considering exact matches or nearest neighbors. Each anchor spot is then assigned a "best" barcode out of its total possible set (it "checks all" of the possible barcodes) based on the spatial variance and intensities of the spots that make it up and whether it has a match to codebook, though the order of those steps will vary.

This larger search space requires a few extra tricks to keep false positive calls down. One of these is a filter that removes barcodes that were not found as best for a certain number of the spots/rounds that make it up (usually one less than the total round number) which the Caltech group calls the seed filter. So if we had a 5 round barcode and that barcode was chosen as the best choice for only 3 of the spots that make it up, it would not pass the filter. I've also introduced a parameter I call strictness that sets a maximum number of possible barcodes a given spot is allowed to have before it is dropped as ambiguous. Barcodes in spot dense regions tend to be difficult to call accurately so it is useful to be able to restrict that if necessary. Another trick I found helpful was to run decoding in multiple nested stages that start off with stricter parameters which are loosened as decoding progresses. This includes the search radius (starts off with radius 0 and increments up to the user-specified value), the number of rounds that can be omitted from the barcode to make a match (only looks for full codes first then looks for partials) and the max allowed score (based on spatial variance and intensity of spots). Between each decoding, the spots that were found to be in decoded barcodes that pass all filters are removed from the original spot set before decoding the remaining spots. This allows you to call high confidence barcodes first and remove their spots, making calling adjacent barcodes easier. Running the whole decoding process repeatedly became quite time consuming so I have added in multiprocessing capability to allow users to speed things up using Python's standard multiprocessing library.

In addition to the filters and incremental decoding I also take advantage of two slightly different decoding algorithms with different precision/recall properties. In one method, which I call "filter-first", once all the possible barcodes for each spot are assembled, the "best" barcode for each spot is chosen based on a scoring function that uses the spatial variance of the spots and their intensities. The chosen barcodes are then matched to the codebook and those that have no match are dropped. The other method, called "decode-first", instead matches all the possible barcodes for each spot to the codebook first and then if there are multiple matches the best is chosen based on the distance score. The "filter-first" method tends to be higher accuracy but return fewer decoded targets (high precision/low recall) while the "decode-first" method finds more mRNA targets but at a cost to accuracy (low precision/high recall). In keeping with my incremental decoding strategy I described earlier, the first set of decoding is done using the high accuracy "filter-first" method followed by decodings done using the low accuracy version.

Below is a description of the required inputs and a step by step explanation of the algorithm.

Inputs:

Steps

  1. For each spot in each round, calculate a scaled intensity value by dividing the spot's intensity by the l2 norm of all the spots found in the same channel and same 100 radius pixel neighborhood.
  2. For each spot in each round, find all neighboring spots in different rounds within the search radius.
  3. From each spot and their neighbors, build all possible combinations of spots that form barcodes using their channel labels.
  4. Choose a "best" barcode for each spot using the scoring function and drop barcodes that don't have codebook matches (this is for the filter-first method and is switched for decode-first).
  5. Drop barcodes that were not found as best for most of the spots that make it up (usually one less than the round total).
  6. Choose between overlapping barcodes (barcodes that share spots) using their scores.
  7. Remove spots found in passing barcodes from original spot set. (2-7 is repeated, first using the filter-first method for each incremental radius and then using the decode-first method for each radius)
  8. One final decoding is done if error_rounds parameter > 0, where partial barcodes are allowed. Uses filter-first method and strict parameters as the barcodes it finds are more error prone than full barcodes.
  9. Turn overall barcode set into DecodedIntensityTable object and return

One final detail worth explaining here is the mode parameter. This parameter can take three values: "high", "med", or "low" and essentially simplifies a number of parameter choices into these three presets including the seed filter value at the end of each decoding step, the max allowed score allowed for a barcode, and the strictness value for the filter-first and decode-first methods. "high" corresponds to high accuracy (lower recall), "low" corresponds to low accuracy (higher recall), and "med" is a balance between the two.

To evaluate the results of our decodings and compare them with starFISH methods, we used two different kinds of qc measures. The first is correlation with orthogonal RNA counts (either smFISH or RNAseq) and the other is the proportion of false positive calls measured by placing "blank" codes in the codebook that adhere to the hamming distance rules of the codebook but don't actually correspond to any true mRNA. This false positive metric was lacking from my previous PR and when we ran it on the previous method our false positive values were quite high. Depending on the method used to calculate the false positive rate and the specific dataset, high false positive rates can be a problem with seqFISH data but the mode parameter allows you to easily adjust the decoding parameters to get the best results.

Below are some figures showing results from 3 different seqFISH datasets comparing the starFISH PerRoundMaxChannel + NEAREST_NEIGHBORS method with each mode choice for the CheckAll decoder. Each red/blue bar pair along the x-axis represent a single cell, FOV, or image chunk (don't have segmented cells yet for this dataset so it is broken into equal sized chunks), while the height of the blue bar represents the total number of on-target (real) barcodes found in that cell/fov/chunk and the red bar height is the off-target count (blank) with the black bar showing the median on-target count. Also printed on each graph is the overall false positive (FP) rate and the pearson correlation with smFISH or RNAseq where available. For each dataset the same set of spots was used for each run.

Intron seqFISH

image

RNA SPOTs seqFISH

image

Mouse Atlas seqFISH

image

From these, you can see that this updated decoder is capable of finding more targets than starFISH, with lower false positive rates, and higher smFISH/RNAseq correlations. The one area where the starFISH methods have us beat is speed. Being more thorough is costly so while starFISH can decode a set of spots in 2-3 minutes on my system, this new decoder can take up to 1.5 hours even when using 16 different simultaneous processes. That time is from the intron seqFISH dataset which is quite large (5 x 12 x 29 x 2048 x 2048) so the run times are more reasonable for smaller datasets.

Running the "make all" tests for my current repo throws a mypy error but in code that I haven't touched (like the ImageStack class code) so hopefully we can work together to get those kinks worked out so we can add this to starFISH so others might use it.

berl commented 2 years ago

@nickeener this is really cool to see- thanks for the contribution! I'll take a look at the test/CI situation and see if I can help

nickeener commented 2 years ago

@berl Any updates on this? It appears that it is erroring out trying to download the sphinx-bootstrap-theme dependency on Python 3.9 but is otherwise passing all the initial tests.

berl commented 2 years ago

@nickeener yeah, this has been fixed in sphinx-bootstrap-theme==0.8.1, so that should get you past the install tests.

add this line to the REQUIREMENTS.txt file: sphinx-bootstrap-theme==0.8.1 and then run make pin-all-requirements which will propagate the new requirement into the REQUIREMENTS-*.txt files. This results in a successful make install-dev on my mac, but I haven't done any other testing past this. if you commit back to this, we'll get the CI tests

nickeener commented 2 years ago

@berl Running into this error after I add the sphinx-bootstrap-theme==0.8.1 line to REQUIREMENTS.txt and then run make pin-all-requirements:

make starfish/REQUIREMENTS-STRICT.txt requirements/REQUIREMENTS-CI.txt requirements/REQUIREMENTS-NAPARI-CI.txt make[1]: Entering directory '/home/nick/projects/spacetx/decoder/starfish' [ ! -e .REQUIREMENTS.txt-env ] || exit 1 make[1]: [Makefile:103: starfish/REQUIREMENTS-STRICT.txt] Error 1 make[1]: Leaving directory '/home/nick/projects/spacetx/decoder/starfish' make: [Makefile:98: pin-all-requirements] Error 2

Any advice on this?

berl commented 2 years ago

@nickeener did you run make pin-all-requirements from inside the starfish directory? my output doesn't show

make[1]: Entering directory '/home/nick/projects/spacetx/decoder/starfish'

but the error is in line 103 where it checks for a virtual environment. can you create a virtual environment with python=3.9 (using conda or venv ) and activate it before running make pin-all-requirements?

Also, what's your OS and conda version if applicable?

nickeener commented 2 years ago

@berl Yes I am in the top level starfish directory when I run make pin-all-requirements. I've tried creating a fresh python3.9 conda env and then installing everything with make install-dev but I get the same exact error.

My OS is Ubuntu 20.04.4 and my conda version is conda 4.13.0.

Would you be able to share the new correctly built requirements files with me so I could just replace the old versions with them?

ttung commented 2 years ago

@nickeener I suspect you previously tried to repin requirements, and there are stale files left over from the attempt. Can you remove $STARFISH/ .REQUIREMENTS.txt-env (rm -rf /home/nick/projects/spacetx/decoder/starfish/ .REQUIREMENTS.txt-env) and try again?

nickeener commented 2 years ago

@ttung That was exactly what was wrong thanks! Was able to run make pin-all-requirements and get the updated requirements files but now it is throwing several mypy errors, all in files that I haven't changed.

ttung commented 2 years ago

I'm cloned your changes into #1984. I have it mostly working.

nickeener commented 2 years ago

@ttung I've pushed a fix here that should solve the test error in the #1984

nickeener commented 2 years ago

@ttung hey just repinging this, the last commit I made here should fix the error message in #1984

nickeener commented 2 years ago

@ttung @berl Repinging again. Would really like to get this pushed soon as we are currently preparing our manuscript that includes this project so it would be nice to be able to say it is already integrated into starFISH.

ttung commented 2 years ago

@nickeener I cherry picked most of https://github.com/spacetx/starfish/pull/1978/commits/8752555eeaa4b4d8a064b18adc1b30dc331826d8 into my branch and fired off a build.

nickeener commented 2 years ago

@ttung @berl Looks like it's passing all the tests now! Let me know what/if there is anything else you'd like from me before merging.

berl commented 2 years ago

@ttung thanks for your help with this! My review was pretty basic but after the edits above, I'm inclined to merge to add this functionality to starfish.