reillytilbury / coppafish

Python version of iss software
MIT License
4 stars 2 forks source link

Documentation Updates #288

Closed reillytilbury closed 1 month ago

reillytilbury commented 1 month ago

I have been putting off documentation updates for way too long! Given that I will be leaving soon, this really needs to happen soon. Increasing the amount of information we have about the algorithms behind filter, stitch, register, call spots and coppafish tools will be absolutely crucial for when I am no longer part of cortexlab.

reillytilbury commented 1 month ago

Call Spots Documentation

The Call Spots Page is a mode of gene calling which runs quickly on a small set of local maxima ($\approx$ 50, 000 per tile) of the anchor image. Initially, this was our final mode of gene calling, but has since been superseded by OMP. That being said, the call spots section is still a crucial part of the pipeline as it estimates several important parameters used in the OMP section.

Some of the most important exported parameters of this section are:

Algorithm Flow Chart

call_spots_colours

Algorithm Breakdown

The inputs to the algorithm are:

0: Preprocessing

We begin by transforming the raw spot colours via the following function: $$F{s,r,c} \mapsto P{t(s), r, c}F{s,r,c} - \beta{s,c}.$$ In the formula above:

%TODO: Add an image of a spot before anything, after scaling, then after removing background

1: Initial Gene Assignment

The purpose of this section is to provide some gene assignments that will facilitate further calculations. To see why this is necessary, for many important variables we hope to calculate (eg: tile-independent bled codes for each gene $E{g,r,c}$ or the tile-dependent bled codes $D{g,t,r,c}$), we need samples of the genes to work with. This requires an initial assignment $g_s$ for each spot $s$. The simplest way to do this would be to:

  1. Create an initial bled code $\tilde{\mathbf{K}_g }= \mathbf{C_g B}$ for each gene $g$ by matrix multiplying the code matrix $\mathbf{C_g}$ for each gene $g$ with the bleed matrix $\mathbf{B}$. (The code matrix $\mathbf{Cg}$ is a binary matrix of shape $(n{\text{rounds}} \times n{\text{dyes}})$ such that $C{grd} = 1$ if gene $g$ has dye $d$ in round $r$ and 0 otherwise)
  2. Compute a similarity matrix $\mathbf{Y}$ of shape $(n{\text{spots}} \times n{\text{genes}})$ of the cosine angle of each spot $s$ to each gene $g$ defined by $$\mathbf{Y_{sg}} = \mathbf{F_s \cdot \tilde{Kg}} = \sum{rc}F{src}\tilde{K{grc}}.$$
  3. Define a gene assignment vector $\mathbf{G}$ of length $n_{\text{spots}}$ defined by $$Gs = \text{argmax} _g Y{sg}.$$

Unfortunately, this method does not work very well! The main reason for this is that the initial bled codes $\mathbf{\tilde{K}_g}$ are very bad estimates of the true bled codes $\mathbf{K_g}$. Step 1 above assumes that each gene's expected code is just a copy of the expected dye in that round. In equations, this is saying

$$\mathbf{Kgr} = \mathbf{B{d(g, r)}}.$$

However, due to random fluctuations in the concentrations of bridge probes for each gene in each round some genes appear systematically brighter/dimmer in some rounds! A much better model for the bled codes is $$\mathbf{Kgr} = \lambda{gr} \mathbf{B_{d(g, r)}},$$

where $\lambda_{gr}$ is called the gene efficiency of gene $g$ in round $r$.

Since we have no prior estimate of $\lambda{gr}$, we need a method which will normalise each spot colour $F{src}$ in each round, thereby removing any systematic scaling between rounds before proceeding in a way similar to steps 2 and 3.

reillytilbury commented 1 month ago

Probabilistic Gene Assignments using FVM

We are going to define a probability distribution of this spot $s$ belonging to dye $d$ in round $r$. To that end, fix a spot $s$ and round $r$ and let $\mathbf{Fr}$ be this spot’s L2-normalized fluorescence vector (length $n{\text{c}}$). Also let $\mathbf{Bd}$ be the L2-normalized fluorescence vector (length $n{\text{c}}$) of dye $d$.

We'd like to assign a probability that any unit vector belongs to dye $d$. The simplest non-uniform distribution defined on the sphere is the Fisher Von Mises distribution, which is just the restriction of a multivariate isotropic normal distribution to the unit sphere. This distribution has 2 parameters:

In our case $\boldsymbol{\mu}$ = $\mathbf{B_d}$ and $\kappa$ is a parameter that we leave arbitrary for the moment. Then the probability (density) of observing a fluorescence vector $\mathbf{f}$ given that that spot comes from dye $d$ is

$$\mathbb{P}[\mathbf{F_r} = \mathbf{f_r} \mid D = d] = K \exp(\kappa\mathbf{f_r} \cdot \mathbf{B_d}),$$

where $K$ is a normalization constant we don’t need to worry about.

Write $\mathbf{F} = ( \mathbf{F1}^T, \cdots, \mathbf{F{nr}}^T)$ for the $n{\text{r}}n_{\text{c}}$-dimensional spot code with each round L2-normalized and $\mathbf{bg} = (\mathbf{B{d(g, 1)}}^T, \cdots, \mathbf{B_{d(g, nr)}}^T)$ for gene $g$’s $n{\text{r}}n_{\text{c}}$-dimensional bled code with each round L2-normalized. Assume that the rounds are statistically independent of each other. Then

$$ \mathbb{P}[\mathbf{F} = \mathbf{f} \mid G = g] = \prod_r K \exp(\kappa \mathbf{fr} \cdot b{g,r}) = \tilde{K} \exp \left( \kappa \sum_r \mathbf{fr} \cdot \mathbf{B{d(g, r)}}\right) = \tilde{K} \exp(\kappa \mathbf{F \cdot b_g}) $$

By Bayes' Theorem:

$$ \mathbb{P}[G = g \mid \mathbf{F} = \mathbf{f}] = \dfrac{\mathbb{P}[\mathbf{F} = \mathbf{f} \mid G = g] \mathbb{P}[G = g]}{ \mathbb{P}[\mathbf{F} = \mathbf{f}]} $$

Assuming a uniform prior on the genes

$$ \mathbb{P}[G = g \mid \mathbf{F} = \mathbf{f}] = \frac{\exp(\kappa \mathbf{b}_g \cdot \mathbf{f})}{\sum_g \exp(\kappa\mathbf{b}_g \cdot \mathbf{f} )} =\text{softmax}(\mathbf{\kappa U f})_g, $$

where $\mathbf{U}$ is a matrix of shape $(n_g, n_r n_c)$ where each row $U_g = \mathbf{b_g}$. This shows that the value of $\kappa$ has no effect on the gene ordering, but just on the spread of probabilities among genes. A value of 0 yields a uniform distribution of probabilities between all genes, while a very large value of $\kappa$ is approximately 1 for the gene with the maximum dot product and 0 for all others.

reillytilbury commented 1 month ago

2: Bleed Matrix Calculation

Now that we have some preliminary gene assignments, we can go about estimating the bleed matrix more accurately. Set some probability threshold $\alpha$ (by default, $\alpha = 0.9$). We define the following sets:

\mathcal{S} = \{ s : p(s) \geq \alpha \},
G_{d, r} = \{ g : d(g,r) = d \},
C_{d, r} = \{ \mathbf{F_{sr}} : s \in \mathcal{S}, \ g_s \in C_{d, r} \}

where:

By taking the union of $C_{d,r}$ across rounds, we end up with a set of reliable colour vector estimates for dye $d$:

\mathcal{C}_d = \bigcup_r C_{d, r}

Let $\mathbf{C}$ be the $(n_{\text{good spots}}, n_c)$ matrix form of the set $\mathcal{C}_d$. This just means each row of $\mathbf{C}$ corresponds to a good spot and each column corresponds to a channel.

We'd like to find a representative colour of the matrix $\mathbf{C}$. One way to do this would be to take the mean along rows of the matrix $\mathbf{C}$. Another, which we favour, is to compute the first singular vectors of $\mathbf{C}$. These are 2 vectors:

such that $||\boldsymbol{\eta}|| = ||\boldsymbol{\omega}|| = 1$ and

$$ C_{s, c} \ \approx \ \lambda \ \eta_s \ \omega_c $$

is as close as possible for all good spots $s$ and channels $c$ in the least-squares sense.

This method is preferable to the mean in that it nicely decomposes $\mathbf{C}$ into 2 components:

  1. A brightness for each spot $\eta_s$.
  2. A spot color $\boldsymbol{\omega}$ that each row is approximately a multiple of.

We then set the bleed matrix for dye $d$ to $\boldsymbol{\omega}$, ie: $\mathbf{B_d} = \boldsymbol{\omega}$.

reillytilbury commented 1 month ago

3: Bayes Mean for Bled Code Estimation

The purpose of this section is to compute an average $(n_r, nc)$ bled code for each gene $g$ that captures the relative brightness changes between rounds for each gene. We will make extensive use of the initial gene assignments, which are used to gather spots for each gene, and the bleed matrix we have computed, which forms the prior estimate of any fluorescence vector estimate. We will estimate tile-dependent bled codes, $\mathbf{D{gtrc}}$ and tile-independent bled codes, $\mathbf{E_{grc}}$. By comparing these, we will be able to correct for tile by tile variations in brightness.

Our method of estimating $\mathbf{E_{g,r}}$ (and $\mathbf{D_{g,t, r}}$):

bayes_mean_single The Bayes Mean biases the sample mean towards a prior vector. This is useful when the number of samples is small and we expect the points to have mean parallel to the prior vector.

Fix a gene $g$ and round $r$ and let $\mathbf{F_{1,r}}, \ldots, \mathbf{F_{n,r}}$ be the round $r$ fluorescence vectors of spots assigned to gene $g$ with high probability.

We'd like to find a representative vector for this data which captures the mean length, but some genes have very few spots so an average is noisy. We therefore bias our mean towards a scaled version of $\mathbf{B}_{d(g,r)}$.

To begin, assume the mean vector $\overline{\mathbf{F}} = \frac{1}{n} \sum_{i} \mathbf{F}_{i,r}$ is normally distributed and impose a normal prior on the space of possible means:

$$ \overline{\mathbf{F}} \sim \mathcal{N}(\boldsymbol{\mu}, I_{n_c}) $$

$$ \boldsymbol{\mu} \sim \mathcal{N}(\mathbf{B}_{d(g,r)}, \Sigma) $$

where

$$ \Sigma = \text{Diag}\left(\frac{1}{\alpha^2}, \frac{1}{\beta^2}, \ldots, \frac{1}{\beta^2}\right), $$

in the orthonormal basis

\mathbf{v}_1 = \mathbf{B}_{d(g,r)},
\mathbf{v}_2, \ldots, \mathbf{v}_n  \text{ orthogonal to } \mathbf{v}_1 .

We are going to let $\alpha << \beta$, which makes the set of probable means $\mathbf{\mu}$ an elongated ellipse along the axis spanned by $\mathbf{B}_{d(g,r) }$.

Define $\mathbf{Q} =\boldsymbol{\Sigma}^{-1}$ and $\mathbf{b} = \mathbf{B_{d(g,r)}}$. Since the normal is a conjugate prior, we know that the posterior $\boldsymbol{\mu} \mid \mathbf{\overline{F}}$ is normal. Thus finding its mean is equivalent to finding its mode. The log-density of $\boldsymbol{\mu} \mid \mathbf{\overline{F}}$ is given by

\begin{aligned}
l(\boldsymbol{\mu}) &= \log P(\boldsymbol{\mu}| \overline{\mathbf{F}} = \mathbf{f}) \\ \\
       &= \log P(\boldsymbol{\mu}) + \log P(\overline{\mathbf{F}} = \mathbf{f} | \boldsymbol{\mu}) + C \\ \\
       &= -\frac{1}{2} (\boldsymbol{\mu} - \mathbf{b})^T Q (\boldsymbol{\mu} - \mathbf{b}) - \frac{1}{2} (\boldsymbol{\mu} - \mathbf{f})^T (\boldsymbol{\mu} - \mathbf{f}) + C
\end{aligned}

This has derivative

$$ \frac{\partial l}{\partial \boldsymbol{\mu}} = -Q (\boldsymbol{\mu} - \boldsymbol{b}) - (\boldsymbol{\mu} - \mathbf{f}) $$

Setting this to $\mathbf{0}$, rearranging for $\boldsymbol{\mu}$ and using the fact that

Q \mathbf{v} = 
\begin{cases}
\alpha^2 \mathbf{v} & \text{if } \mathbf{v} = \lambda\mathbf{b} \\ \\
\beta^2 \mathbf{v} & \text{otherwise}
\end{cases}

we get

\begin{aligned}
\mathbf{\hat{\mu}} &= (Q + I)^{-1}(Q \mathbf{b} + \mathbf{f}) \\ \\ 
    &= (Q + I)^{-1}(\alpha^2 \mathbf{b} + \mathbf{f}) \\ \\
    &= (Q + I)^{-1}(\alpha^2 \mathbf{b} + (\mathbf{f} \cdot \mathbf{b})\mathbf{b} + \mathbf{f} - (\mathbf{f} \cdot \mathbf{b})\mathbf{b}) \\ \\
    &= (Q + I)^{-1}((\alpha^2 + \mathbf{f} \cdot \mathbf{b})\mathbf{b} + \mathbf{f} - (\mathbf{f} \cdot \mathbf{b})\mathbf{b}) \\ \\
&= \dfrac{(\alpha^2 + \mathbf{f} \cdot \mathbf{b})}{1 + \alpha^2} \mathbf{b} +
 \dfrac{1}{1+\beta^2} \bigg( \mathbf{f} - (\mathbf{f} \cdot \mathbf{b})\mathbf{b} \bigg)
\end{aligned}

Plugging in $\mathbf{f} = \frac{1}{n}\sumi \mathbf{F{i, r}}$ yields our estimate $\mathbf{\hat{\mu}}$

Bayes Mean Decreasing $\beta$ increases the component of the Bayes Mean $\boldsymbol{\hat{\mu}}$ perpendicular to the prior vector. The values of $\alpha^2$ and $\beta^2$ should be thought of as the number of spots needed to change the scale and direction respectively of the prior vector.

reillytilbury commented 1 month ago

4: Generation of Target Bled Codes $\mathbf{K_g}$

The purpose of this section is to try and remove systematic brightness differences between rounds and channels and to ensure that the dyes are well separated.

To begin, fix a round $r$ and channel $c$ and let $d_{max}(c)$ be the dye which is most intense in channel c. We define a target value $Td$ for each dye $d$ in its maximal channel $c{max}(d)$. Now let $G_{r,d}$ be the set of genes with dye $d$ in round $r$, and define the loss function

L(V_{r, c}) = \sum_{g \in G_{r, \ d_{max}(c)}} \sqrt{N_{g}} \  \bigg( V_{r, c} \ E_{g, r, c} - T_{d_{max}(c)} \bigg)^2,

where $N_g$ is the number of high probability spots assigned to gene $g$. There is no reason this has to be a square root, though if it is not, too much influence is given to the most frequent genes. Minimise this loss to obtain

V_{r, c} = \dfrac{ \sum_{g \in G_{r, \ d_{max}(c) }} \sqrt{N_g} E_{grc} T_{d_{max}(c)} } { \sum_{g \in G_{r, \ d_{max}(c) }} \sqrt{N_g} E_{grc}^2 },

which is our optimal value!

scales The gene intensities for each round and channel plotted in cyan, and their scaled versions plotted in red, showing how they have been recentred around the target values.

scales_im (1) The target scale matrix shows that most of its job is boosting channel 15 in this case, but the amount it boosts these values is highly variable between rounds.

Now define the target bled codes

K_{g,r,c} = E_{g,r,c}V_{r,c},

which we will use instead of $E_{g,r,c}$ from here onwards.

reillytilbury commented 1 month ago

5: Regression of $\mathbf{D_{g,t}}$ against $\mathbf{K_g}$

The purpose of this section is to remove brightness differences between tiles, and improve the round and channel normalisation we found in the previous step. We do this by finding a scale factor $A_{t, r, c}$ such that

A_{t,r,c} D_{g, t, r, c} \approx K_{g, r, c},

where $\mathbf{D_{g, t,}}$ is the tile-dependent bled code for gene $g$ in tile $t$ defined in step 3 and $\mathbf{K_g}$ is the target bled code for gene $g$ defined in step 4.

Our method works in a similar way to step 4: fix a tile $t$, round $r$ and channel $c$ and as above, let $G_{r,d}$ be the genes with dye $d$ in round $r$. Define the loss

L(A_{t,r, c}) = \sum_{g \in G_{r, \ d_{max}(c)}} \sqrt{N_{g,t}} \  \bigg( A_{t,r, c} \ D_{g, t r, c} - K_{g, r, c} \bigg)^2,

where $N{g, t}$ is the number of high probability spots of gene $g$ in tile $t$. Remember that $K{g, r, c} = E{g, r, c}V{r, c},$ so writing this in full yields

L(A_{t,r, c}) = \sum_{g \in G_{r, \ d_{max}(c)}} \sqrt{N_{g,t}} \  \bigg( A_{t,r, c} \ D_{g, t r, c} - E_{g, r, c}V_{r, c} \bigg)^2.

This means that if $D{g,t,r,c} \approx E{g,r,c}$ for all genes $g$ then $A{t,r,c} \approx V{r,c}$. This means that $\mathbf{A}$ is correcting for tile differences. Then why does it have indices for $r$ and $c$? Because the way that the brightness varies between tiles may be completely independent for different round-channel pairs. This is addressed further in the diagnostics. Minimising this loss yields:

A_{t,r,c} = \dfrac{ \sum_{g \in G_{r, \ d_{max}(c) }} \sqrt{N_{g, t}} \ K_{g,r,c}  D_{g, t r, c}} { \sum_{g \in G_{r, \ d_{max}(c) }} \sqrt{N_{g, t}}  D_{g, t r, c}^2 }.

homogeneous scale regression We can use the view_homogeneous_scale_regression function to view the regression for a single tile. Note that the regressions seem to have a high $r^2$ value and the slopes are significantly different even within channels.

different scales

reillytilbury commented 1 month ago

6 and 7: Application of Scales, Computation of Final Scores and Bleed Matrix

The purpose of this step is to bring all the components together and compute the final scores and bleed matrix.

Now that we have the homogeneous scale $A{t,r,c}$, we multiply it by the initial scale factor $P{t, r, c}$ to get our final colour normalisation factor

$$ A{t, r, c} \mapsto A{t, r, c} P_{t, r, c}.$$

This is important as all our calculations have been done on preprocessed spot colours which have already been multiplied by $\mathbf{P}$. We apply this scale to all of our spot colours $F$ by pointwise multiplication.

Next, we compute the final probabilities and dot product scores. We might ask at this point whether we should use:

  1. The tile-independent bled codes $E_{g,r,c}$,
  2. the tile-dependent bled codes $D_{g,t,r,c}$ or,
  3. the target bled codes $K_{g, r, c}$?

The answer is definitely the target bled codes $K_{g, r, c}$! We calculated $\mathbf{E}$ and $\mathbf{D}$ as important summary statistics. These act as representative samples for each gene in the case of $\mathbf{E}$ and for each gene and tile in the case of $\mathbf{D}$.

While it may seem more accurate to have a different gene code for each tile, the estimates in $\mathbf{D}$ are noisy due to small numbers of samples. Remember also that $\mathbf{A}$ was calculated specifically to maximise similarity of the tile-dependent codes $\mathbf{D}$ with the tile-independent target codes $\mathbf{K}$, meaning that multiplying spot colours by $\mathbf{A}$ has the dual effect of

  1. homogenising them across tiles and,
  2. bringing them all close to the target codes $\mathbf{K}$.

With that in mind, we compute:

  1. Final gene probabilities using the scaled spots $\mathbf{AF}$ and comparing against the target codes $\mathbf{K}$
  2. Final Dot Products using the scaled spots $\mathbf{AF}$ and comparing against the target codes $\mathbf{K}$. These would not have been accurate in step 1 as we had no model of how each gene varied in brightness between rounds, but now this is something we have accounted for in $\mathbf{K}$.
  3. The Final Bleed Matrix using the same method as discussed in step 2, but with updated gene probabilities.
reillytilbury commented 1 month ago

Call Spots Documentation

The Call Reference Spots section of the pipeline is a method of gene calling which runs quickly on a small set of spots ($\approx$ 50, 000 per tile) of the anchor image. Initially, this was our final mode of gene calling, but has since been superseded by OMP, which differs from Call Spots in that it runs on several more spots. That being said, the call spots section is still a crucial part of the pipeline as it estimates several important parameters used in the OMP section.

Some of the most important exported parameters of this section are:

Algorithm Flow Chart

call_spots_colours

Algorithm Breakdown

The inputs to the algorithm are:

0: Preprocessing

The purpose of this step is to convert our raw pixels from integers between -15,000 and 50,000 to floating points numbers. We want to minimise the influence of variable brightness between different tiles, rounds and channels, and get rid of background fluorescence as much as possible.

We transform the raw spot colours via the following function: $$F{s,r,c} \mapsto P{t(s), r, c}F{s,r,c} - \beta{s,c}.$$ In the formula above:

%TODO: Add an image of a spot before anything, after scaling, then after removing background

1: Initial Gene Assignment

The purpose of this step is to provide some gene assignments that will facilitate further calculations. This is necessary as there are many important variables (eg: tile-independent free bled codes for each gene $E{g,r,c}$ or the tile-dependent free bled codes $D{g,t,r,c}$) which cannot be calculated without sample spots of these genes.

The Naive Approach

The simplest way to do gene assignments would be:

  1. Create an initial bled code $\tilde{\mathbf{K}}_g = \mathbf{C_g B}$ for each gene $g$ by matrix multiplying the code matrix $\mathbf{C_g}$ for each gene $g$ with the bleed matrix $\mathbf{B}$.
  2. Compute a similarity matrix $\mathbf{Y}$ of shape $(n{\text{spots}} \times n{\text{genes}})$ of the cosine angle of each spot $s$ to each gene $g$ defined by
    \mathbf{Y_{sg}}  = \mathbf{F_s \cdot \tilde{K}_g} = \sum_{r,c}F_{s,r,c}\tilde{K}_{g,r,c}. 
  3. Define a gene assignment vector $\mathbf{G}$ of length $n_{\text{spots}}$ defined by $$Gs = \text{argmax} _g Y{sg}.$$

Unfortunately, the naive approach does not work well at all! The main reason for this is that the initial bled codes $\tilde{\mathbf{K}}_g$ are bad estimates of the true bled codes $\mathbf{K_g}$. To see why, note that step 1 above assumes that each gene's expected code is just a copy of the expected dye in that round. In equations, this is saying

\tilde{\mathbf{K}}_{g,r} = \mathbf{B_{d(g, r)}}.

However, due to random fluctuations in the concentrations of bridge probes for each gene in each round some genes appear systematically brighter/dimmer in some rounds! A much better model for the bled codes is

\tilde{\mathbf{K}}_{g, r} = \lambda_{gr} \mathbf{B_{d(g, r)}},

where $\lambda_{gr}$ is called the gene efficiency of gene $g$ in round $r$.

Since we have no prior estimate of $\lambda{gr}$, we need a method which will normalise each spot colour $F{src}$ in each round, thereby removing any systematic scaling between rounds before proceeding in a way similar to steps 2 and 3.

Probabilistic Gene Assignments using FVM

We are going to define a probability distribution of this spot $s$ belonging to dye $d$ in round $r$ that gets around the issues mentioned in the naive approach. Fix a spot $s$ and round $r$ and let $\mathbf{Fr}$ be this spot’s L2-normalized fluorescence vector (length $n{\text{c}}$). Let $\mathbf{Bd}$ be the L2-normalized fluorescence vector (length $n{\text{c}}$) of dye $d$.

We'd like to assign a probability that any unit vector belongs to dye $d$. The simplest non-uniform distribution defined on the sphere is the Fisher Von Mises distribution, which is just the restriction of a multivariate isotropic normal distribution to the unit sphere. This distribution has 2 parameters:

In our case $\boldsymbol{\mu}$ = $\mathbf{B_d}$ and $\kappa$ is a parameter that we leave arbitrary for the moment. Then the probability (density) of observing a fluorescence vector $\mathbf{f_r}$ given that that spot comes from dye $d$ is

$$\mathbb{P}[\mathbf{F_r} = \mathbf{f_r} \mid D = d] = M \exp(\kappa\mathbf{f_r} \cdot \mathbf{B_d}),$$

where $M$ is a normalization constant we don’t need to worry about.

Write $\mathbf{F} = ( \mathbf{F1}^T, \cdots, \mathbf{F{nr}}^T)$ for the $n{\text{r}}n_{\text{c}}$-dimensional spot code with each round L2-normalized and $\mathbf{bg} = (\mathbf{B{d(g, 1)}}^T, \cdots, \mathbf{B_{d(g, nr)}}^T)$ for gene $g$’s $n{\text{r}}n_{\text{c}}$-dimensional bled code with each round L2-normalized. Assume that the rounds are statistically independent of each other. Then

$$ \mathbb{P}[\mathbf{F} = \mathbf{f} \mid G = g] = \prod_r M \exp(\kappa \mathbf{fr} \cdot \mathbf{b{g,r}}) = \tilde{M} \exp \left( \kappa \sum_r \mathbf{fr} \cdot \mathbf{B{d(g, r)}}\right) = \tilde{M} \exp(\kappa \mathbf{F \cdot b_g}) $$

By Bayes' Theorem:

$$ \mathbb{P}[G = g \mid \mathbf{F} = \mathbf{f}] = \dfrac{\mathbb{P}[\mathbf{F} = \mathbf{f} \mid G = g] \mathbb{P}[G = g]}{ \mathbb{P}[\mathbf{F} = \mathbf{f}]} $$

Assuming a uniform prior on the genes

$$ \mathbb{P}[G = g \mid \mathbf{F} = \mathbf{f}] = \frac{\exp(\kappa \mathbf{b}_g \cdot \mathbf{f})}{\sum_g \exp(\kappa\mathbf{b}_g \cdot \mathbf{f} )} =\text{softmax}(\mathbf{\kappa U f})_g, $$

where $\mathbf{U}$ is a matrix of shape $(n_g, n_r n_c)$ where each row $U_g = \mathbf{b_g}$. This shows that the value of $\kappa$ has no effect on the gene ordering, but just on the spread of probabilities among genes. A value of 0 yields a uniform distribution of probabilities between all genes, while a very large value of $\kappa$ is approximately 1 for the gene with the maximum dot product and 0 for all others.

2: Bleed Matrix Calculation

The purpose of this step is to compute an updated estimate of the bleed matrix. To get a more accurate estimate of the colour spectrum for each dye $d$ we will find high probability spots for genes $g$ containing dye $d$ and then extract the rounds in which dye $d$ is supposed to be present. We will then use singular-value decomposition to get a representative vector for the set of samples.

Set some probability threshold $\alpha$ (by default, $\alpha = 0.9$). We define the following sets:

\mathcal{S} = \{ s : p(s) \geq \alpha \},
G_{d, r} = \{ g : d(g,r) = d \},
J_{d, r} = \{ \mathbf{F_{sr}} : s \in \mathcal{S}, \ g_s \in C_{d, r} \}

where:

By taking the union of $J_{d,r}$ across rounds, we end up with a set of reliable colour vector estimates for dye $d$:

\mathcal{J}_d = \bigcup_r J_{d, r}

Let $\mathbf{J}$ be the $(n_{\text{good spots}}, n_c)$ matrix form of the set $\mathcal{J}_d$. This just means each row of $\mathbf{J}$ corresponds to a good spot and each column corresponds to a channel.

We'd like to find a representative colour of the matrix $\mathbf{J}$. One way to do this would be to take the mean along rows of the matrix $\mathbf{J}$. Another, which we favour, is to compute the first singular vectors of $\mathbf{J}$. These are 2 vectors:

such that $||\boldsymbol{\eta}|| = ||\boldsymbol{\omega}|| = 1$ and

$$ J_{s, c} \ \approx \ \lambda \ \eta_s \ \omega_c $$

is as close as possible for all good spots $s$ and channels $c$ in the least-squares sense.

This method is preferable in that it assumes that rows of $\mathbf{J}$ are scaled versions of the same vector, and tries to find the best vector fitting this model. We get a nice decomposition of $\mathbf{J}$ into 2 components:

  1. A brightness for each spot $\eta_s$.
  2. A spot color $\boldsymbol{\omega}$ that each row is approximately a multiple of.

We then set the bleed matrix for dye $d$ to $\boldsymbol{\omega}$, ie: $\mathbf{B_d} = \boldsymbol{\omega}$, which is a normalised fluorescence vector for dye $d$.

3: Bayes Mean for Bled Code Estimation

Terminology:

The purpose of this step is to compute normalised free bled codes $\mathbf{bg}$ for each gene $g$. We will make extensive use of the initial gene assignments and the bleed matrix we have computed, which forms the prior estimate of any fluorescence vector estimate. We will estimate tile-dependent free bled codes, $\mathbf{D{gtrc}}$ and tile-independent free bled codes, $\mathbf{E_{grc}}$. By comparing these, we will be able to correct for tile by tile variations in brightness in step 5.

Our method of estimating $\mathbf{E_{g,r}}$ (and $\mathbf{D_{g,t, r}}$ just on a smaller set of sample spots):

bayes_mean_single The Bayes Mean biases the sample mean towards a prior vector. This is useful when the number of samples is small and we expect the points to have mean parallel to the prior vector.

The Parallel Biased Bayes Mean

Fix a gene $g$ and round $r$ and let $\mathbf{F_{1,r}}, \ldots, \mathbf{F_{n,r}}$ be the round $r$ fluorescence vectors of spots assigned to gene $g$ with high probability.

We'd like to find a representative vector for this data which captures the mean length, but some genes have very few spots so an average is noisy. We therefore bias our mean towards a scaled version of $\mathbf{B}_{d(g,r)}$.

To begin, assume the mean vector $\overline{\mathbf{F}} = \frac{1}{n} \sum_{i} \mathbf{F}_{i,r}$ is normally distributed and impose a normal prior on the space of possible means:

$$ \overline{\mathbf{F}} \sim \mathcal{N}(\boldsymbol{\mu}, I_{n_c}) $$

$$ \boldsymbol{\mu} \sim \mathcal{N}(\mathbf{B}_{d(g,r)}, \Sigma) $$

where

$$ \Sigma = \text{Diag}\left(\frac{1}{\alpha^2}, \frac{1}{\beta^2}, \ldots, \frac{1}{\beta^2}\right), $$

in the orthonormal basis

\mathbf{v}_1 = \mathbf{B}_{d(g,r)},
\mathbf{v}_2, \ldots, \mathbf{v}_n  \text{ orthogonal to } \mathbf{v}_1.

The parameters $\alpha$ and $\beta$, being inverse variances, are concentration parameters. We are going to let $\alpha << \beta$, which means the set of probable means is very unconcentrated along the line $\lambda \mathbf{B}_{d(g,r) }$, but very concentrated perpendicular to this. Another way of saying this is that our set of probable means under the prior distribution on $\mathbf{\mu}$ will be an elongated ellipsoid along the axis spanned by $\mathbf{B}_{d(g,r) }$.

Define $\mathbf{Q} =\boldsymbol{\Sigma}^{-1}$ and $\mathbf{b} = \mathbf{B_{d(g,r)}}$. Since the normal is a conjugate prior when the data is normal, we know that the posterior $\boldsymbol{\mu} \mid \mathbf{\overline{F}}$ is normal. Thus finding its mean is equivalent to finding its mode. To find its mode we will find the zero of the derivative of its log-density. The log-density of $\boldsymbol{\mu} \mid \mathbf{\overline{F}}$ is given by

\begin{aligned}
l(\boldsymbol{\mu}) &= \log P(\boldsymbol{\mu}| \overline{\mathbf{F}} = \mathbf{f}) \\ \\
       &= \log P(\boldsymbol{\mu}) + \log P(\overline{\mathbf{F}} = \mathbf{f} | \boldsymbol{\mu}) + C \\ \\
       &= -\frac{1}{2} (\boldsymbol{\mu} - \mathbf{b})^T Q (\boldsymbol{\mu} - \mathbf{b}) - \frac{1}{2} (\boldsymbol{\mu} - \mathbf{f})^T (\boldsymbol{\mu} - \mathbf{f}) + C
\end{aligned}

This has derivative

$$ \frac{\partial l}{\partial \boldsymbol{\mu}} = -Q (\boldsymbol{\mu} - \boldsymbol{b}) - (\boldsymbol{\mu} - \mathbf{f}) $$

Setting this to $\mathbf{0}$, rearranging for $\boldsymbol{\mu}$ and using the fact that

Q \mathbf{v} = 
\begin{cases}
\alpha^2 \mathbf{v} & \text{if } \mathbf{v} = \lambda\mathbf{b} \\ \\
\beta^2 \mathbf{v} & \text{otherwise}
\end{cases}

we get

\begin{aligned}
\mathbf{\hat{\mu}} &= (Q + I)^{-1}(Q \mathbf{b} + \mathbf{f}) \\ \\ 
    &= (Q + I)^{-1}(\alpha^2 \mathbf{b} + \mathbf{f}) \\ \\
    &= (Q + I)^{-1}(\alpha^2 \mathbf{b} + (\mathbf{f} \cdot \mathbf{b})\mathbf{b} + \mathbf{f} - (\mathbf{f} \cdot \mathbf{b})\mathbf{b}) \\ \\
    &= (Q + I)^{-1}((\alpha^2 + \mathbf{f} \cdot \mathbf{b})\mathbf{b} + \mathbf{f} - (\mathbf{f} \cdot \mathbf{b})\mathbf{b}) \\ \\
&= \dfrac{(\alpha^2 + \mathbf{f} \cdot \mathbf{b})}{1 + \alpha^2} \mathbf{b} +
 \dfrac{1}{1+\beta^2} \bigg( \mathbf{f} - (\mathbf{f} \cdot \mathbf{b})\mathbf{b} \bigg)
\end{aligned}

Plugging in $\mathbf{f} = \frac{1}{n}\sumi \mathbf{F{i, r}}$ yields our estimate $\mathbf{\hat{\mu}}$

Bayes Mean Decreasing $\beta$ increases the component of the Bayes Mean $\boldsymbol{\hat{\mu}}$ perpendicular to the prior vector. The values of $\alpha^2$ and $\beta^2$ should be thought of as the number of spots needed to change the scale and direction respectively of the prior vector.

4: Generation of Target Bled Codes $\mathbf{K_g}$

The purpose of this step is to try and remove systematic brightness differences between rounds and channels and ensure that the dyes are well separated.

To begin, fix a round $r$ and channel $c$ and let $d_{max}(c)$ be the dye which is most intense in channel $c$. We define a target value $Td$ for each dye $d$ in its maximal channel $c{max}(d)$. Now let $G_{r,d}$ be the set of genes with dye $d$ in round $r$, and define the loss function

L(V_{r, c}) = \sum_{g \in G_{r, \ d_{max}(c)}} \sqrt{N_{g}} \  \bigg( V_{r, c} \ E_{g, r, c} - T_{d_{max}(c)} \bigg)^2,

where $N_g$ is the number of high probability spots assigned to gene $g$. There is no reason this has to be a square root, though if it is not, too much influence is given to the most frequent genes. Minimise this loss to obtain

V_{r, c} = \dfrac{ \sum_{g \in G_{r, \ d_{max}(c) }} \sqrt{N_g} E_{grc} T_{d_{max}(c)} } { \sum_{g \in G_{r, \ d_{max}(c) }} \sqrt{N_g} E_{grc}^2 },

which is our optimal value!

scales The gene intensities for each round and channel plotted in cyan, and their scaled versions plotted in red, showing how they have been recentred around the target values.

scales_im (1) The target scale matrix shows that most of its job is boosting channel 15 in this case, but the amount it boosts these values is highly variable between rounds.

Now define the constrained bled codes, which we will just call bled codes

K_{g,r,c} = E_{g,r,c}V_{r,c}.

We will use these instead of $E_{g,r,c}$ from here onwards.

5: Regression of $\mathbf{D_{g,t}}$ against $\mathbf{K_g}$

The purpose of this step is to remove brightness differences between tiles, and improve the round and channel normalisation we found in the previous step. We do this by finding a scale factor $A_{t, r, c}$ such that

A_{t,r,c} D_{g, t, r, c} \approx K_{g, r, c},

where $\mathbf{D_{g, t,}}$ is the tile-dependent bled code for gene $g$ in tile $t$ defined in step 3 and $\mathbf{K_g}$ is the target bled code for gene $g$ defined in step 4.

Our method works in a similar way to step 4: fix a tile $t$, round $r$ and channel $c$ and as above, let $G_{r,d}$ be the genes with dye $d$ in round $r$. Define the loss

L(A_{t,r, c}) = \sum_{g \in G_{r, \ d_{max}(c)}} \sqrt{N_{g,t}} \  \bigg( A_{t,r, c} \ D_{g, t r, c} - K_{g, r, c} \bigg)^2,

where $N{g, t}$ is the number of high probability spots of gene $g$ in tile $t$. Remember that $K{g, r, c} = E{g, r, c}V{r, c},$ so writing this in full yields

L(A_{t,r, c}) = \sum_{g \in G_{r, \ d_{max}(c)}} \sqrt{N_{g,t}} \  \bigg( A_{t,r, c} \ D_{g, t r, c} - E_{g, r, c}V_{r, c} \bigg)^2.

This means that if $D{g,t,r,c} \approx E{g,r,c}$ for all genes $g$ then $A{t,r,c} \approx V{r,c}$. This means that $\mathbf{A}$ is correcting for tile differences. Then why does it have indices for $r$ and $c$? Because the way that the brightness varies between tiles may be completely independent for different round-channel pairs. This is addressed further in the diagnostics. Minimising this loss yields:

A_{t,r,c} = \dfrac{ \sum_{g \in G_{r, \ d_{max}(c) }} \sqrt{N_{g, t}} \ K_{g,r,c}  D_{g, t r, c}} { \sum_{g \in G_{r, \ d_{max}(c) }} \sqrt{N_{g, t}}  D_{g, t r, c}^2 }.

homogeneous scale regression We can use the view_homogeneous_scale_regression function to view the regression for a single tile. Note that the regressions seem to have a high $r^2$ value and the slopes are significantly different even within channels.

different scales

6 and 7: Application of Scales, Computation of Final Scores and Bleed Matrix

The purpose of this step is to bring all the components together and compute the final scores and bleed matrix.

Now that we have the homogeneous scale $A{t,r,c}$, we multiply it by the initial scale factor $P{t, r, c}$ to get our final colour normalisation factor

$$ A{t, r, c} \mapsto A{t, r, c} P_{t, r, c}.$$

This is important as all our calculations have been done on preprocessed spot colours which have already been multiplied by $\mathbf{P}$. We apply this scale to all of our spot colours $F$ by pointwise multiplication.

Next, we compute the final probabilities and dot product scores. We might ask at this point whether we should use:

  1. The tile-independent free bled codes $E_{g,r,c}$,
  2. the tile-dependent free bled codes $D_{g,t,r,c}$ or,
  3. the constrained bled codes $K_{g, r, c}$?

The best answer is definitely the target bled codes $K_{g, r, c}$! We calculated $\mathbf{E}$ and $\mathbf{D}$ as important summary statistics which act as representative samples for each gene in the case of $\mathbf{E}$, and for each gene and tile in the case of $\mathbf{D}$. These were only computed to facilitate the computations of $\mathbf{A}$ and $\mathbf{K}$.

While it may seem more accurate to have a different gene code for each tile, the estimates $\mathbf{D_{g, t}}$ are noisy due to small numbers of samples. The colour normalisation factor $\mathbf{A}$ was calculated specifically to maximise similarity of the tile-dependent free codes $\mathbf{D}$ with the tile-independent constrained codes $\mathbf{K}$, meaning that multiplying spot colours by $\mathbf{A}$ has the dual effect of

  1. homogenising them across tiles and,
  2. bringing them all close to the constrained codes $\mathbf{K}$.

With that in mind, we compute:

  1. Final gene probabilities using the scaled spots $\mathbf{AF}$ and comparing against the constrained codes $\mathbf{K}$
  2. Final Dot Products using the scaled spots $\mathbf{AF}$ and comparing against the constrained codes $\mathbf{K}$. These would not have been accurate in step 1 as we had no model of how each gene varied in brightness between rounds, but now this is something we have accounted for in $\mathbf{K}$.
  3. The Final Bleed Matrix using the same method as discussed in step 2, but with updated gene probabilities.