plevritis-lab / CELESTA

Automate unsupervised machine learning cell type identification using both protein expressions and cell spatial neighborhood information for multiplexed in situ imaging data. No training dataset with cell type labels is required.
Apache License 2.0
27 stars 9 forks source link

Cell assignment bug in second and later rounds #25

Open Sz-Jason-Chen opened 2 months ago

Sz-Jason-Chen commented 2 months ago

Cell assignment bug in second and later rounds

Problem

In the second and later rounds of cell assignment, if the cell types that will be assigned in this round belong to more than one type in the previous round, the code will wrongly compare the cells' previous types, thus causing errors in cell assignment.

For example, the following cell type table (replacement of prior_marker_info.csv) results in the error. If the BarA cell type is removed from the cell type table, the error does not occur.

Cell Lineage Level
Foo 1_0_1
Bar 1_0_2
FooA 2_1_3
FooB 2_1_4
BarA 2_2_5

Explanation

This bug happens in the FindCellsToCheck function that finds all unassigned cells at the beginning of every round.

FindCellsToCheck <- function(current_cell_type_assignment, lineage_info, cell_ID, round) 
{
  if (round == 1) {
    unassigned_cells <- cell_ID[which(current_cell_type_assignment[, round] == 0)]
  } else {
    previous_level_type <- unique(lineage_infoPrevious_cell_type[which(lineage_info$Round == round)])
    previous_level_round <- unique(lineage_info$Round[which
      (lineage_info$Cell_type_number == previous_level_type)])
    unassigned_cells <- cell_ID[which( current_cell_type_assignment[, round] == 0 &
      (current_cell_type_assignment[, previous_level_round] == previous_level_type))]
  }
  return(unassigned_cells)
}

current_cell_type_assignment[, previous_level_round] is a list with a length of all cells, and previous_level_round is a list with a length of previous cell types. Using == to compare these two lists will return a boolean list representing the corresponding relationship between every element in the two lists, rather than the inclusion relationship between each cell's previous type and the intended types (the previous_level_type). If the length of previous_level_round equals 1 (there is only one cell type being assigned in the current round, like in the default prior_marker_info.csv file in the repo), the check works as expected. But if the length is longer than 1, R will try to "broadcast" previous_level_type to match current_cell_type_assignment[, previous_level_round], and get the elements corresponding equality as comparison result.

This can be demonstrated by a simple example:

c1 <- c(1, 2, 3, 4, 1, 2, 3, 4, 2, 1, 1)
c2 <- c(1, 2)
print(c1 == c2)  # you may get a warning here
# TRUE TRUE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE
c3 <- c(1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1)
print(c1 == c3)  # same as c1 == c2

When attempting to assign cells with multiple previous types in one round, it will effectively return TRUE with a random probability of 1/N, where N is the number of previous cell types being assigned in the current round. This comparison results in much fewer cells being assigned in the round than expected.

Solution

To fix this bug, the == operator should be changed to %in%, which correctly compares the cell types to previous_level_type.

unassigned_cells <- cell_ID[which( current_cell_type_assignment[, round] == 0 &
      (current_cell_type_assignment[, previous_level_round] %in% previous_level_type))]

The effect of %in% can also be validated by this example

c1 <- c(1, 2, 3, 4, 1, 2, 3, 4, 2, 1, 1)
c2 <- c(1, 2)
print(c1 %in% c2)
# TRUE TRUE FALSE FALSE TRUE TRUE FALSE FALSE TRUE TRUE TRUE

Question

The examples given in the code all have the assignment rounds as "serialized" rounds, with each round only assigning a single cell type. The above bug only occurs if multiple cell types are assigned within a single round.

The code appears to run fine when assigning multiple cell types within a single round, but I would like to inquire if there are any issues that you can foresee when doing so from a statistical standpoint.