labsyspharm / quantification

Quantification module for mcmicro
https://github.com/labsyspharm/mcmicro
9 stars 13 forks source link

Attempted fix for Issues 26 and 28 by writing CSVs to different files #29

Closed JoshuaHess12 closed 3 years ago

JoshuaHess12 commented 3 years ago

Adjusted code to write separate CSVs for each input mask rather than concatenating quantification output into a single CSV file.

ArtemSokolov commented 3 years ago

Thank you, @JoshuaHess12 I will test this today.

ArtemSokolov commented 3 years ago

This definitely addresses #26. However, #28 still seems to be an issue.

When doing --masks cytoRingMask.tif by itself, cell 2171 doesn't get quantified because it has zero area (which is correct):

  CellID X_centroid Y_centroid  FDX1 CD357  CD1D
    <dbl>      <dbl>      <dbl> <dbl> <dbl> <dbl>
 1   2170      1003.      1301. 2247. 1483. 1064.
 2   2172      1570.      1304. 3153. 1322. 1337.
 3   2173       675.      1304. 3661. 1151. 1253.

However, when quantifying multiple masks with --masks nulceiRingMask.tif cytoRingMask.tif, cells 2172 onward appear to be shifted up, which causes a mismatch between the expression columns and the cell position:

   CellID X_centroid Y_centroid  FDX1 CD357  CD1D
    <dbl>      <dbl>      <dbl> <dbl> <dbl> <dbl>
 1   2170      1001.      1301. 2247. 1483. 1064.
 2   2171      1040.      1299. 3153. 1322. 1337.  <-- The expression of FDX1, CD357 and CD1D is from Cell 2172 above
 3   2172      1570.      1303. 3661. 1151. 1253.  <-- The expression of FDX1, CD357 and CD1D is from Cell 2173 above
 4   2173       676.      1304. 2779. 1468. 1096.  <-- etc.

It seems that there is "cross-talk" between masks, where the cell position is taken from nucleiRingMask, while the expression is taken from cytoRingMask. Ideally, each mask should be quantified in isolation, without any merging or concatenation against other masks.

Steps to reproduce:

  1. Ensure Nextflow and Docker are installed
  2. Download the exemplar: nextflow run labsyspharm/mcmicro/exemplar.nf --name exemplar-001 --path .
  3. Generate segmentation masks: nextflow run labsyspharm/mcmicro --in ./exemplar-001 --stop-at segmentation --s3seg-opts '--segmentCytoplasm segmentCytoplasm --cytoDilation 3 --cytoMethod ring'
  4. Quantify cytoRingMask only:
    cd exemplar-001/
    mkdir cytoOnly
    python CommandSingleCellExtraction.py \
    --image registration/exemplar-001.ome.tif \
    --masks segmentation/unmicst-exemplar-001/cytoRingMask.tif \
    --channel_names markers.csv \
    --output cytoOnly
  5. Quantify both masks:
    mkdir both
    python CommandSingleCellExtraction.py \
    --image registration/exemplar-001.ome.tif \
    --masks segmentation/unmicst-exemplar-001/nucleiRingMask.tif segmentation/unmicst-exemplar-001/cytoRingMask.tif \
    --channel_names markers.csv \
    --output both
  6. Compare the expression of markers for cells 2172 and 2173:
    
    $ sed -n -e 1p -e '2171,2174p' cytoOnly/exemplar-001_cytoRingMask.csv | cut -d ',' -f 1,11-15 | \
    sed "s/,/\t/g" | sed 's/\(\.[0-9][0-9]\)[0-9]*/\1/g'

CellID FDX1 CD357 CD1D X_centroid Y_centroid 2170 2247.09 1482.88 1064.37 1003.11 1301.05 2172 3153.31 1322.06 1337.24 1569.82 1303.80 2173 3660.94 1150.97 1252.94 675.15 1304.17 2174 2779.16 1468.5 1096.03 815.46 1301.94

$ sed -n -e 1p -e '2171,2174p' both/exemplar-001_cytoRingMask.csv | cut -d ',' -f 1,11-15 | \ sed "s/,/\t/g" | sed 's/(.[0-9][0-9])[0-9]*/\1/g'

CellID FDX1 CD357 CD1D X_centroid Y_centroid 2170 2247.09 1482.88 1064.37 1000.95 1301.36 2171 3153.31 1322.06 1337.24 1040.01 1298.77 2172 3660.94 1150.97 1252.94 1569.5 1303.27 2173 2779.16 1468.5 1096.03 675.52 1304.18

ArtemSokolov commented 3 years ago

Following up on the above, the likely culprit is in the following:

Here, IDs are extracted from the first mask: https://github.com/JoshuaHess12/quantification/blob/6c4addabd5888397eb38cbf4a360171b28edede3/SingleCellDataExtraction.py#L133

but then get concatenated to all other tables: https://github.com/JoshuaHess12/quantification/blob/6c4addabd5888397eb38cbf4a360171b28edede3/SingleCellDataExtraction.py#L151

This concatenation assumes that the same set of cells is present in every mask. Unfortunately, this assumption is violated when a cell has zero area (as in the cytoplasm example above). A suggested fix is to fully isolate the processing of a single mask file, including the extraction of Cell IDs. The outer loop can then call the corresponding function with a single mask a time, which will ensure that no "cross-talk" between masks happens.

JoshuaHess12 commented 3 years ago

I think the processing of each mask is already uncoupled in the for loop -- there isn't any crosstalk between the masks with the way this pull request exports the CSVs. The CellIDs are mismatched because regionprops in Python automatically enumerates the CellIDs for us by sweeping from left to right across the image. If there is no cytoplasm object for a cell, then all the other CellIDs for the cytoplasm mask will be shifted up by a value of one in the CellID column of the cytoplasm CSV compared to the nucleus CSV file.

I think one way to fix this would be to do a 1-nearest neighbor assignment from the other CSV files to the nuclei CSV file based on their spatial coordinates. If we assume that the cytoplasm of each cell is always going to be closest to its own nucleus then this may work. We could relabel all other CellID rows in the mismatched CSVs according to the index of their nearest neighbor in the nuclei CSV.

JoshuaHess12 commented 3 years ago

Wait, you may be right @ArtemSokolov . Sorry about that. I will look at this a little more.

ArtemSokolov commented 3 years ago

Thanks for looking into it, @JoshuaHess12

The CellIDs are mismatched because regionprops in Python automatically enumerates the CellIDs for us by sweeping from left to right across the image.

So, I actually had this concern before also, but I verified with Clarence that regionprops() extracts Cell IDs directly from the mask file, and the upstream segmentation module ensures that Cell IDs match between nucleus and cytoplasm masks, even if some cells are not captured by one of those masks. This is why we see skipped IDs, like in this example.

CellID X_centroid Y_centroid  FDX1 CD357  CD1D
    <dbl>      <dbl>      <dbl> <dbl> <dbl> <dbl>
 1   2170      1003.      1301. 2247. 1483. 1064.
 2   2172      1570.      1304. 3153. 1322. 1337.
 3   2173       675.      1304. 3661. 1151. 1253.

I think the end goal is just to ensure that the output exemplar-001_cytoRingMask.csv is the same, regardless of whether the user calls the tool with --masks cytoRingMask.tif alone or jointly with --masks nucleiRingMask.tif cytoRingMask.tif.

JoshuaHess12 commented 3 years ago

@ArtemSokolov No problem! I think this makes sense now. I moved the extraction of Cell IDs inside the loop so that it gets executed separately for each mask. Let me know if the latest commit addresses the issue.

ArtemSokolov commented 3 years ago

Work great, @JoshuaHess12! I can confirm that --masks cytoRingMask.tif and --masks nucleiRingMask.tif cytoRingMask.tif produce identical .csv files for the cytoplasm mask.