Rewrite quantification module to allow image preprocessing before quantification

Yu-AnChen commented 3 years ago

Major changes

Separates morphology and intensity tables
Add option to allow only quantify morphology or intensity
Add option to add customized image processing step before quantifying
Add option to reduce I/O in cost of using more RAM

TODOs

[ ] Add hdf5 reader
[ ] Try out processing in chunks
[ ] Update CLI
[ ] Include imagej rolling ball background subtraction

Yu-AnChen commented 3 years ago

Is this the column order we want or the additional properties should come after the fixed properties?

In [8]: p = pipeline.Pipeline(mask_paths=mask_paths, mask_props=['moments', 'coords'])

In [9]: p.run()
Quantifying mask <cellRingMask.tif>
Completed. max id: 13678 number of ids: 13678

In [10]: pd.read_csv('_cellRingMask_morphology.csv', nrows=2).columns
Out[10]:
Index(['CellID', 'moments-0-0', 'moments-0-1', 'moments-0-2', 'moments-0-3',
       'moments-1-0', 'moments-1-1', 'moments-1-2', 'moments-1-3',
       'moments-2-0', 'moments-2-1', 'moments-2-2', 'moments-2-3',
       'moments-3-0', 'moments-3-1', 'moments-3-2', 'moments-3-3', 'coords',
       'X_centroid', 'Y_centroid', 'Area', 'MajorAxisLength',
       'MinorAxisLength', 'Eccentricity', 'Solidity', 'Extent', 'Orientation'],
      dtype='object')

Yu-AnChen commented 3 years ago

And also how do we want to arrange the table when there are multiple intensity properties? Do we still keep the centroid columns at the end of the table?

In [22]: p = Pipeline(mask_paths=mask_paths, img_path=img_path, marker_csv_path=marker_csv_path, skip='
    ...: morphology', save_RAM=False, intensity_props=['median_intensity'])

In [23]: p.run()
Quantifying channel with mask ['cellRingMask.tif', 'cytoRingMask.tif', 'nucleiRingMask.tif']
100%|██████████████████████████████████████████████████████████████████| 12/12 [00:44<00:00,  3.71s/it] 
Completed. cellRingMask.tif - exemplar-001.ome.tif
max id: 13678 number of ids: 13678

Completed. cytoRingMask.tif - exemplar-001.ome.tif
max id: 13678 number of ids: 12448

Completed. nucleiRingMask.tif - exemplar-001.ome.tif
max id: 13424 number of ids: 13424

In [26]: pd.read_csv('exemplar-001_cellRingMask_intensity.csv', nrows=2).columns
Out[26]:
Index(['CellID', 'DNA_1_mean_intensity', 'DNA_1_median_intensity',
       'AF488_mean_intensity', 'AF488_median_intensity',
       'AF555_mean_intensity', 'AF555_median_intensity',
       'AF647_mean_intensity', 'AF647_median_intensity',
       'DNA_2_mean_intensity', 'DNA_2_median_intensity',
       'A488_background_mean_intensity', 'A488_background_median_intensity',
       'A555_background_mean_intensity', 'A555_background_median_intensity',
       'A647_background_mean_intensity', 'A647_background_median_intensity',
       'DNA_3_mean_intensity', 'DNA_3_median_intensity', 'FDX1_mean_intensity',
       'FDX1_median_intensity', 'CD357_mean_intensity',
       'CD357_median_intensity', 'CD1D_mean_intensity',
       'CD1D_median_intensity', 'X_centroid', 'Y_centroid'],
      dtype='object')

ArtemSokolov commented 3 years ago

I would really prefer to not have suffixes on column names, because it creates a mismatch with markers.csv and leads to a lot of problems in downstream processing. (This was the whole motivation for #26 in the first place.)

I like the idea of creating a subdirectory structure, where we can capture all combinations of background subtraction methods and intensity properties. For example,

# First token in directory name specifies the background subtraction method
original-mean/unmicst-exemplar-001_cell.csv
constant-mean/unmicst-exemplar-001_cell.csv
rollball-mean/unmicst-exemplar-001_cell.csv

# Second token specifies the intensity properties
original-mean/unmicst-exemplar-001_cell.csv
original-median/unmicst-exemplar-001_cell.csv
original-gini/unmicst-exemplar-001_cell.csv

# Suffix on the filename to denote which mask was quantified
original-mean/unmicst-exemplar-001_cell.csv
original-mean/unmicst-exemplar-001_nuclei.csv
original-mean/unmicst-exemplar-001_cyto.csv

# Morphological features get their own directory
morpho/unmicst-exemplar-001_cell.csv
morpho/unmicst-exemplar-001_nuclei.csv
morpho/unmicst-exemplar-001_cyto.csv

As far as column order, I don't think it matters. I don't think any of the downstream modules assume any specific order.

Yu-AnChen commented 3 years ago

I see, it is confusing that the column names not matching the marker names.

Just notes regarding the example above, the parameters for each of the image processing will only be captured in the log file and the files are not meant to be moved out of the containing folders.

So in the case of TMA, the structure would look like the following -

# Suffix on the filename to denote which mask was quantified
original-mean/unmicst-Core_1_cell.csv
original-mean/unmicst-Core_1_nuclei.csv
original-mean/unmicst-Core_1_cyto.csv
original-mean/unmicst-Core_2_cell.csv
original-mean/unmicst-Core_2_nuclei.csv
original-mean/unmicst-Core_2_cyto.csv
original-mean/unmicst-Core_N_cell.csv
original-mean/unmicst-Core_N_nuclei.csv
original-mean/unmicst-Core_N_cyto.csv

I recall the column order is to match the histoCAT format, if that's not a constrain, I'll place the additional morphological columns at the end of the table and for the intensity tables, I'll keep pushing the X_centroid, Y_centroid columns to the end

ArtemSokolov commented 3 years ago

Yes, that TMA example looks correct to me.

@DenisSch can chime in, but I don't think we are constraining the column order to match histoCAT. As far as I know, all downstream tools reference columns by their name instead of index.

Yu-AnChen commented 3 years ago

I looked into what are possible outputs of the intensity features in skimage.measure.regionprops, according to the v0.18 source code (there are more in the dev branch), for example, weighted_centroid is one of them -

In [43]: skimage.measure.regionprops_table(np.eye(3, 3, dtype=int), np.random.random((3, 3)), propertie
    ...: s=['weighted_centroid', 'mean_intensity'])
Out[43]:
{'weighted_centroid-0': array([0.62728601]),
 'weighted_centroid-1': array([0.62728601]),
 'mean_intensity': array([0.34246916])}

And in this case, we will not be able to use a marker name to replace the two keys in the resulting dictionary when one chooses to compute weighted centroids for each channel unless we decide to create X number of folders for X number of returned keys per property, which is probably not ideal. Any ideas/recommendations on this?

As for the mcmicro-specific file structure and naming, the approach now I have in mind is to proceed with the following steps

The module outputs a raw and flat table as a temporary file
In the mcmicro-specific script, re-organize the table and directory structure
Remove the temporary file

ArtemSokolov commented 3 years ago

I think weighted_centroids would be morphology features, even though they depend on an intensity channel. So, they should go into the morpho/ directory, alongside the standard centroid:

CellID, centroid_x, centroid_y, weighted_centroid_CD45_x, weighted_centroid_CD45_y, etc.

But you are right: we should check to see what possible intensity features may produce more than a single value. It's possible that we may need different strategies for different features.

I don't think it's a problem to create X directories for X intensity features. The expectation is that a user will only select one (maybe two) features, with mean, median and the gini index being the most common.

Having separate scripts for default and mcmicro-specific behaviors makes sense to me. We did the same thing for the ilastik module.

labsyspharm / quantification

Rewrite quantification module to allow image preprocessing before quantification #31