spacetx / starfish

starfish: unified pipelines for image-based transcriptomics
https://spacetx-starfish.readthedocs.io/en/latest/
MIT License
225 stars 67 forks source link

New seqFISH Decoding Method #1964

Closed ttung closed 2 years ago

ttung commented 2 years ago

This PR adds a new spot-based decoding method, the CheckAll decoder, based on the method described here (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6046268/). It is capable of detecting several times more true targets from seqFISH image data than current spot-based methods in starFISH (PerRoundMaxChannel) as barcodes are not restricted to exactly matching spots or only nearest neighbor spots and because it tries to put together barcodes based on spots from every round instead of a single arbitrary anchor round. It is also capable of utilizing error correction rounds in the codebook which current starFISH methods do not consider.

Summary of algorithm:

Inputs: spots - starFISH SpotFindingResults object codebook - starFISH Codebook object filter_rounds - Number of rounds that a barcode must be identified in to pass filters error_rounds - Number of error-correction rounds built into the codebook (ie number of rounds that can be dropped from a barcode and still be able to uniquely match to a single target in the codebook)

1. For each spot in each round, find all neighbors in other rounds that are within the search radius
2. For each spot in each round, build all possible full length barcodes based on the channel labels of the spot's 
neighbors and itself
3. Drop barcodes that don't have a matching target in the codebook
4. Choose the "best" barcode of each spot's possible target matching barcodes by calculating the sum of variances 
for each of the spatial coordinates of the spots that make up each barcode and choosing the minimum distance barcode 
(if there is a tie, they are all dropped as ambiguous). Each spot is assigned a "best" barcode in this way.
    sum( var(x1,x2,...,xn), var(y1,y2,...,yn), var(z1,z2,...,zn) ), where n = # spots in barcode
5. Only keep barcodes/targets that were found as "best" in a certain number of the rounds (determined by filter_rounds
parameter)
6. If a specific spot is used in more than one of the remaining barcodes, the barcode with the higher spatial variance
between it's spots is dropped (ensures each spot is only used once)
(End here if number of error_rounds = 0)
7. Remove all spots used in decoded targets that passed the previous filtering steps from the original set of spots
8. Rerun steps 2-5 for barcodes that use less than the full set of rounds for codebook matching (how many rounds can be
dropped determined by error_rounds parameter)

Tests of the CheckAll decoder vs starFISH's PerRoundMaxChannel method (w/ nearest neighbor trace building strategy) show improved performance with the CheckAll decoder. All the following tests used seqfISH image data from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6046268/.

image

Note PRMC NN = PerRoundMaxChannel Nearest Neighbor

The x-axis for each of the above images marks the value for the search radius parameter used in either decoding method (the distance that spots can be from a reference spot and be allowed to form a potential barcode). It is marked in increments of increasing symmetric neighborhood size (in 3D). The left figure shows the total number of decoded transcripts that are assigned to a cell for each method (note: for the CheckAll decoder this includes partial barcodes (codes that did not use all rounds in decoding) which the PerRoundMaxChannel method does not consider). Depending on the search radius, there is as much as a 442% increase in the total number of decoded barcodes for the CheckAll decoder vs PerRoundMaxChannel.

To assess the accuracy of either decoding method, I used orthologous smFISH data that was available from the same samples for several dozen of the same genes as were probed in the seqFISH experiment. Using this data, I calculated the Pearson correlation coefficient for the correlation between the smFISH data and the results from decoding the seqFISH data with either method (note: because the targets in this dataset were introns (see paper) the values that were correlated were the calculated burst frequencies for each gene (how often/fast transcription is cycled on/off) instead of counts). The results of this are shown in the center figure above with the right-hand figure showing the same data but zoomed out to a 0-1 range. The starFISH PerRoundMaxChannel method does achieve a higher accuracy using this test but it is not significant and comes at the cost of detecting far fewer barcodes. (Note: missing values on lower end of x-axis are due to not having enough results to calculate the burst frequency of the transcripts).

Unlike current starFISH methods, the CheckAll decoder is capable of taking advantage of error correction rounds built into the codebook. As an example, say a experiment is designed with a codebook that has 5 rounds, but the codes are designed in such a way that only any 4 of those rounds are needed to uniquely match a barcode to a target, the additional round would be considered an error correction round because you may be able to uniquely identify a barcode as a specific target with only 4 rounds, but if you can also use that fifth round you can be extra confident that the spot combination making up a barcode is correct. This method is based on a previous pull request made by a colleague of mine (https://github.com/ctcisar/starfish/pull/1).

image

The above figures show similar results to the first figure except the results of the CheckAll decoder have been split between barcodes that were made using spots in all rounds (error correction) and those that only had a partial match (no correction). Even without considering error correction, the CheckAll decoder detects as much as 181% more barcodes than the PerRoundMaxChannel method. The smFISH correlation are as expected with error corrected barcodes achieving a higher correlation score with the smFISH data than those that were not corrected. Whether a barcode in the final DecodedIntensityTable uses an error correction round or not can be extracted from the new "rounds_used" field which shows the number of rounds used to make a barcode for each barcode in the table. This allows easy separation of data into high and lower confidence calls. Additionally, the distance field of the DecodedIntensityTable is no longer based on the intensity of the spots in each barcode but is the value that is calculated for the sum of variances of the list of spatial coordinates for each spot in a barcode. This can also be used as a filter in many cases as barcodes made of more tightly clustered spots may be more likely to be true targets.

image

The major downside to the CheckAll decoder is it's speed. This is no surprise, as it is searching the entire possible barcode space for every spot from all rounds instead of just nearest neighbors to spots in a single round, but the possible barcode space can become quite large as search radius increases which can significantly increase run times. In order to address this, I've added the ability to multi-thread the program and run multiple chunks simultaneously in parallel using the python module ray, though even with this added parallelization, runtimes for CheckAll are much higher than for PerRoundMaxChannel. The above figure shows the runtime in minutes for the CheckAll decoder (using 16 threads) vs PerRoundMaxChannel with nearest neighbors (note: the seqFISH dataset used here is among the larger that are available at 5 rounds, 12 channels, and over 10,000 barcodes in the codebook so for most other seqFISH datasets I expect runtimes will be considerably less than what is shown here, unfortunately I did not have access to another suitable seqFISH dataset to test on). Ongoing work is being done to optimize the method and bring runtimes down. I was unable to figure out how to correctly add ray to the requirements file so that will still need to be done.

nickeener commented 2 years ago

@ttung I fixed the problem with the linting and pushed the changes to my original PR but I'm not sure how to transfer those changes to this new PR you've made.

nickeener commented 2 years ago

After merging changes from #1963, it seems to pass all tests