Automate unsupervised machine learning cell type identification using both protein expressions and cell spatial neighborhood information for multiplexed in situ imaging data. No training dataset with cell type labels is required.
Apache License 2.0
27
stars
9
forks
source link
Cell assignment bug in second and later rounds #25
In the second and later rounds of cell assignment, if the cell types that will be assigned in this round belong to more than one type in the previous round, the code will wrongly compare the cells' previous types, thus causing errors in cell assignment.
For example, the following cell type table (replacement of prior_marker_info.csv) results in the error. If the BarA cell type is removed from the cell type table, the error does not occur.
Cell
Lineage Level
Foo
1_0_1
Bar
1_0_2
FooA
2_1_3
FooB
2_1_4
BarA
2_2_5
Explanation
This bug happens in the FindCellsToCheck function that finds all unassigned cells at the beginning of every round.
current_cell_type_assignment[, previous_level_round] is a list with a length of all cells, and previous_level_round is a list with a length of previous cell types. Using == to compare these two lists will return a boolean list representing the corresponding relationship between every element in the two lists, rather than the inclusion relationship between each cell's previous type and the intended types (the previous_level_type). If the length of previous_level_round equals 1 (there is only one cell type being assigned in the current round, like in the default prior_marker_info.csv file in the repo), the check works as expected. But if the length is longer than 1, R will try to "broadcast" previous_level_type to match current_cell_type_assignment[, previous_level_round], and get the elements corresponding equality as comparison result.
This can be demonstrated by a simple example:
c1 <- c(1, 2, 3, 4, 1, 2, 3, 4, 2, 1, 1)
c2 <- c(1, 2)
print(c1 == c2) # you may get a warning here
# TRUE TRUE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE
c3 <- c(1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1)
print(c1 == c3) # same as c1 == c2
When attempting to assign cells with multiple previous types in one round, it will effectively return TRUE with a random probability of 1/N, where N is the number of previous cell types being assigned in the current round. This comparison results in much fewer cells being assigned in the round than expected.
Solution
To fix this bug, the == operator should be changed to %in%, which correctly compares the cell types to previous_level_type.
The examples given in the code all have the assignment rounds as "serialized" rounds, with each round only assigning a single cell type. The above bug only occurs if multiple cell types are assigned within a single round.
The code appears to run fine when assigning multiple cell types within a single round, but I would like to inquire if there are any issues that you can foresee when doing so from a statistical standpoint.
Cell assignment bug in second and later rounds
Problem
In the second and later rounds of cell assignment, if the cell types that will be assigned in this round belong to more than one type in the previous round, the code will wrongly compare the cells' previous types, thus causing errors in cell assignment.
For example, the following cell type table (replacement of
prior_marker_info.csv
) results in the error. If theBarA
cell type is removed from the cell type table, the error does not occur.Explanation
This bug happens in the
FindCellsToCheck
function that finds all unassigned cells at the beginning of every round.current_cell_type_assignment[, previous_level_round]
is a list with a length of all cells, andprevious_level_round
is a list with a length of previous cell types. Using==
to compare these two lists will return a boolean list representing the corresponding relationship between every element in the two lists, rather than the inclusion relationship between each cell's previous type and the intended types (theprevious_level_type
). If the length ofprevious_level_round
equals 1 (there is only one cell type being assigned in the current round, like in the defaultprior_marker_info.csv
file in the repo), the check works as expected. But if the length is longer than 1, R will try to "broadcast"previous_level_type
to matchcurrent_cell_type_assignment[, previous_level_round]
, and get the elements corresponding equality as comparison result.This can be demonstrated by a simple example:
When attempting to assign cells with multiple previous types in one round, it will effectively return
TRUE
with a random probability of 1/N, where N is the number of previous cell types being assigned in the current round. This comparison results in much fewer cells being assigned in the round than expected.Solution
To fix this bug, the
==
operator should be changed to%in%
, which correctly compares the cell types toprevious_level_type
.The effect of
%in%
can also be validated by this exampleQuestion
The examples given in the code all have the assignment rounds as "serialized" rounds, with each round only assigning a single cell type. The above bug only occurs if multiple cell types are assigned within a single round.
The code appears to run fine when assigning multiple cell types within a single round, but I would like to inquire if there are any issues that you can foresee when doing so from a statistical standpoint.