labsyspharm / quantification

Quantification module for mcmicro
https://github.com/labsyspharm/mcmicro
9 stars 13 forks source link

[Major bug] Mismatch of cells when quantifying multiple masks #28

Closed ArtemSokolov closed 3 years ago

ArtemSokolov commented 3 years ago

Consider exemplar-001 processed with:

nextflow run labsyspharm/mcmicro --in /path/to/exemplar-001 --stop-at segmentation s3seg-opts '--segmentCytoplasm segmentCytoplasm --cytoDilation 3 --cytoMethod ring'

This generates separate segmentation masks for cell, nuclei and cytoplasm. Next, we consider three separate approaches to quantifying these masks:

Because a different number of cells are present in each segmentation mask, nuclOnly.csv and cytoOnly.csv will contain 9,753 and 9,744 cells, respectively. This raises an important issue for the way the different number of cells are combined in both.csv.

For example, here's a slice of nuclOnly.csv:

 CellID X_centroid Y_centroid CD357_nucl
  <int>      <dbl>      <dbl>      <dbl>
    118       927.       719.      1221.
    119      1289.       717.       951.
    120      1452.       721.      2030.
    121      1205.       718.       875.

Cell 119 is not captured by the cytoplasm mask, and the corresponding slice of cytoOnly.csv is:

 CellID X_centroid Y_centroid CD357_cyto
  <int>      <dbl>      <dbl>      <dbl>
    118       926.       720.      1156.
    120      1452.       719.      1547.
    121      1206.       721.       914.

So far, so good. Other than cell 119, there is a direct correspondence between the two feature tables when the tables are written to separate files. Furthermore, the correlation between CD357 expression in nuclOnly.csv and cytoOnly.csv is 0.826.

However, when the two masks are quantified together into a single feature table, CellIDs become mismatched. The corresponding slice in both.csv looks as follows:

 CellID X_centroid Y_centroid CD357_nucl CD357_cyto
  <int>      <dbl>      <dbl>      <dbl>      <dbl>
    118       927.       719.      1221.      1156.
    119      1289.       717.       951.      1547.     <-- Compare CD357_cyto to CellID 120 above
    120      1452.       721.      2030.       914.
    121      1205.       718.       875.      1334.

The value of CD357_cyto for CellID 119 was taken from Cell 120 in cytoOnly.csv, instead of being properly marked as NA. Instead, all NA values are congregated at the bottom.

Because of this "shift" in the cell identities, the correlation between CD357_nucl and CD357_cyto in both.csv is 0.045, which is a substantial drop from 0.826 when the masks were written to separate files.

Conclusion: Cells are mismatched when multiple masks are quantified and results combined into a single file. This further motivates the implementation of #26 as a top priority.

ArtemSokolov commented 3 years ago

Addressed in #29