Open agitter opened 5 years ago
Since there are many corrupted raw images, we only had features extracted from 317 plates. Then, for each plate, I randomly sample 5 images, from random DMSO wells and random field of view. It gave me 317 * 5 = 1585 rows.
Then we can visualize this 1585 * 4096 matrix.
Following the same procedure, I sampled 1585 non-DMSO images. We also can visualize this 1585 * 4096 matrix.
I added a color bar on the left to indicate which batch is the row comes from.
The mapping is {1 => red, 2 => orange, 3 => yellow, 4 => green, 5 => skyblue, 6 => blue, 7 => purple, 8 => black}.
Also see our batch grouping in the DMSO distribution.
I subsampled features of 5 plates in batch 3, and 5 plates in batch 4 to test batch effect adjustment algorithms. It gave me a (41410, 4096) feature matrix.
Then, I used the ComBat()
function from the R package sva
to remove the batch effect in this feature matrix. sva
is a popular package for gene expression/RNA sequencing, and ComBat()
uses Bayes method to adjust batch effects.
I visualized the heat map (4096, 4096) for the feature matrix. In each heatmap, top 2048 rows are randomly sampled from the rows in batch 3, while the bottom 2048 rows are from batch 4. Two heatmaps have the same sampling index.
ComBat()
really removes the batch effects. We also can plot the UMAP before/after normalization, but I am pretty confident that two batches should have a similar distribution in the 2D space.We can try the sva
function in the sva
package to do an unsupervised analysis on the same data used for the supervised ComBat
analysis run above. Here supervised refers to having the batch identities.
We will need a way to evaluate the three current normalization strategies:
One approach would be to normalize, run UMAP on the feature matrix, and visualize the DMSO images in the new 2D space from UMAP. If the normalization worked well, we might hope to see a single DMSO cluster now instead of the dual clusters we saw for some batches in #5.
We will also continue thinking about the calibration transfer strategies, possibly including the CNN approach we discussed before.
We could also consider encoding the batch (or plate?) as a feature and use that when predicting drug responses. The batches have clear visual differences but we created them, whereas the plates are directly technical indices.
bc2547babfcbcfe07dbab68cf945eefa4f4e1ebe adds UMAP plots for CombatP (Parametric Combat) normalization. Surprisingly, two batches are separated in the UMAP 2D space after normalization, even though we have seen a consistent heat map in https://github.com/gitter-lab/pharmaco-image/issues/9#issuecomment-444744466.
sva
I have been trying to use sva
to get batch labels, instead of setting them artificially. Since sva
requires model matrix and other parameters, I found it hard to set up for our feature matrix. And sva
is very slow.
I found a paper Removing Batch Effects From Histopathological Images for Enhanced Cancer Diagnosis which tried 5 different methods to remove the batch effect of histopathological images (mean, rank, ratio, combatP, combatN). It concluded that combatN has the most positive impact on later prediction performance. Mean normalization also has a good result.
Therefore, I tried to use their mean normalization method on our (41410, 4096) feature matrix.
The result is better than combatP method:
CombatN is extremely slow. It runs 5 hours (only for 10 plates) on Condor. It failed to find adjustment (result matrix contains only NA values). I will try to rerun the job again.
The standard batch effect correction methods are either too slow or unsatisfactory. The network-based normalization could still possibly work but will be hard to implement and execute.
We could also contact the Broad team again to point out what we perceive as a batch effect and ask for suggestions. Instead of going to the authors directly, we can post to the meta-image analysis board that supports Cell Profiler, ImageJ, etc.
The imaging forum (image.sc) is very helpful. Someone actually asked questions about the Cell Painting data we are using.
Based on the replies, Carpenter Lab has noticed the batch effects. For the provided CellProfiler profiling features, they have processed per-plate normalization using z-score method.
Carpenter Lab has published a very nice paper discussing the cell image processing procedure in their lab. For batch effect detection, it suggests to aggregate the negative control feature for each plate (median is recommended), and then visualize the correlation matrix.
For batch effect removal, it recommends us to use DMSO (negative control) as the "normalizing population," then apply z-score normalization (similar to the mean-method we have used) within each plate, directly on every feature.
Relative normalization. This procedure consists of computing statistics (for example, median and median absolute deviation) in one population of samples, and then centering and scaling the rest with respect to that population. Ideally, features are normalized across an entire screen in which batch effects are absent; however, normalization within plates is generally performed to correct for batch effects (described in 'Batch-effect correction'). When choosing the normalizing population, we suggest the use of control samples (assuming that they are present in sufficient quantity), because the presence of dramatic phenotypes may confound results. This procedure is good practice regardless of the normalization being performed within plates or across the screen. Alternately, all samples on a plate can be used as the normalizing population when negative controls are unavailable, too few, or unsuitable for some reason, and when samples on each plate are expected to not be enriched in dramatic phenotypes.
We recommend applying normalization across all features. Normalization can be applied even if features are not transformed, and it is preferable to remove biases while simultaneously fixing range issues. z-score normalization is the most commonly used procedure in our laboratories. Normalization also aligns the range of different features, thus decreasing the effects of unbalanced scales when computing similarities (described in 'Measuring profile similarity') or applying analysis algorithms (described in 'Downstream analysis'). It is advisable to compare several transformation and normalization methods, because their performance can vary significantly among assays.
Using the recommended batch effect detection method, we can get the following correlation heatmap. I used the "elemental-wise" median of DMSO CNN feature (4096) in each plate to compute this 317*317 correlation matrix. The horizontal and vertical axises are the sorted 317 plate ID's.
I have tried the z-score normalization on all extracted features. The results do not look very good, so I tried some other methods.
One issue I have encountered is the variance of some features on the normalizing population is 0, but they are not 0 on treated images. I have discussed it with Shantanu on image.sc forum. I believe he knows we are working on the Cellpainting data.
My solution is to use the feature variance of all images instead of DMSO images if the latter variance is 0. If the variance for all images in that plate is 0, I then stop normalizing that feature (mean = 0, std=1) and report it. The idea is applied to all three normalization below (use super set to compute std).
Method | Description |
---|---|
DMSO Within-plate Normalization | Use in-plate DMSO features to compute mean and sd for each plate |
All-Feature Within-plate Normalization | Use all in-plate features to compute mean and sd for each plate |
DMSO Across-plate Normalization | Use all DMSO features across all plates to compute mean and sd for all plates |
406 Plates Before Normalization |
---|
![]() |
406 Plates After DMSO Within-plate Normalization |
![]() |
406 Plates After All-Feature Within-plate Normalization |
![]() |
406 Plates After DMSO Across-plate Normalization |
![]() |
These next steps make sense. UMAP has been helpful in visualizing the batch effects in the past. We are comfortable sharing these heatmaps in the image.sc forum if it can help us determine the best normalization.
I asked for suggestions here, but Shantanu didn't reply. I guess these questions are too detailed, and not related to CellProfilers.
Then I tried to revitalize the normalization results using UMAP with some slight visualization changes.
before normalization
, After DMSO Within-plate Normalization
, After All-Feature Within-plate Normalization
, After DMSO Across-plate Normalization
)406 Plates Before Normalization |
---|
![]() |
406 Plates After DMSO Within-plate Normalization |
![]() |
406 Plates After All-Feature Within-plate Normalization |
![]() |
406 Plates After DMSO Across-plate Normalization |
![]() |
DMSO Within-plate Normalization
and All-Feature Within-plate Normalization
strategies are working.DMSO Within-plate Normalization
method, we can use this version of features in the proceeding training.I agree DMSO Within-plate Normalization
looks great. We can proceed with that.
Now that we are happy with the 2D UMAP representations after this normalization, we can spot check individual images to confirm the extracted features and UMAP 2D coordinates make sense with respect to the original images. If we fit a 2D Gaussian to the DMSO images in the 2D space, we can score each treated image by the likelihood it was generated from that Gaussian. Then we can look at 10-100 images with the highest likelihood and 10-100 with the lowest. Perhaps we do this for more than one batch.
One hypothesis is still that the outliers here are those where the chemicals have the greatest effect on the cancer cells. That may be a research question we can explore next. Before trying to systematically validate the outliers, we could look at some that are strong outliers and manually search ChEMBL or other databases to see what evidence exists that we can use to support our predictions.
I was preparing for an admission interview this week. To show that professor I am capable of designing and implementing visual analytics systems, I designed a visualization for flexibly viewing a million of UMAP dots.
The link above should be private. He gave us lots of suggestions in terms of this visualization. The plot is still sketchy. I can give you a demo in our meeting, and we can discuss if this plot is helpful.
Very nice! I'm curious to hear the suggestions. Did you do this directly in D3 or does it use some library?
Yes, I am directly using D3.js and HTML SVG. It might not be our main interest, but I got many fun suggestions from my interviewer and Michael Gleicher.
I have tried to fit a Gaussian 2D distribution on all normalized DMSO UMAP values, then I can automatically detect outliers of non-DMSO UMAP points.
If I fit the mixture model using component=1
, then this model predicts all non-DMSO dots have 100% probability of falling in the DMSO cluster.
If I fit the mixture model using component=2
, then it fits two 2D Gaussian distributions within the DMSO cluster (departed by a y threshold). It does not really help us detect outliers.
I manually picked outliers from 3 regions: far top (y > 10), far right (x > 10), cluster bottom edge area (y < -5) to inspect the raw images. I am fixing a bug in my image inspect code.
Group | Channel 123 | Channel 45 |
---|---|---|
y>10 | ![]() |
![]() |
y<-5 | ![]() |
![]() |
x>10 | ![]() |
![]() |
Even with these outliers, it ends up being difficult for us to manually evaluate whether they are reasonable or not. The x > 10
example could be an effective drug that killed all the cancer cells or a technical artifact with no cells in the viewing frame.
Another direction would be to look for positive controls, known effective drugs. If these are reported in the Gigascience paper, we could find the corresponding 2D coordinates and look at the images.
The Broad Institute and Sanger have projects that screen drugs against many types of cancer cell lines. If they have U2OS data, it could also give us positive controls.
Based on #7, we have concerns about over-normalization and removing the signal. We may be able to design more direct tests of the normalization. For instance, if we select some training batches and a different test batch, can we predict DMSO versus non-DMSO on the test batch successfully? We could directly compare results for the original features and normalized features. The normalization is within plate, so it should be fair to normalize even the test batch images.
The Broad Institute and Sanger have projects that screen drugs against many types of cancer cell lines. If they have U2OS data, it could also give us positive controls.
Both of these include U2OS drug screening data and can serve as positive controls of drugs that kill the cancer cells if we need those. We should also check whether the Gigascience paper discusses these screening datasets and any overlapping chemicals.
The Broad data was originally reported in https://www.nature.com/articles/nature11003#supplementary-information I checked the supplementary Excel file to confirm U2OS is there. It may be more direct to register for their data portal to access the data though: https://portals.broadinstitute.org/ccle/about
The Sanger drug sensitivity data is https://www.cancerrxgene.org/translation/CellLine/909776 They used z-scores to assess sensitivity and resistance. From the plot, it looks like no drugs passed their -2.0 threshold for sensitivity. We could still potentially learn what the most sensitive versus the most resistant chemicals do to the cells.
e10df9bc80588526266e9cc3f3992f6c18e471be adds analysis of these positive control compounds.
Below is their UMAP project on the pre-normalized DMSO space:
Below is their UMAP project on the normalized DMSO space:
The running jobs to look at the images from the max and min z-scores may still be informative. Otherwise, these results are somewhat confusing. We might expect that the images for the sensitive drugs with the most positive z-scores have fewer cells or show some sign that the cells are being affected or killed. In the 2D plots, however, the yellow and dark blue points are sometimes adjacent and the yellow points are distributed throughout the 2D space.
We have 11 compounds with known effects overlapping with our image dataset. This corresponds to 504 images. I randomly sampled 100 images from channel 123 and channel 45 respectively. Then, I sorted these raw images by their drug sensitivity Z scores.
My first observations were the same as yours, which is not encouraging. However, there may be more subtle biological cues for which we both lack suitable expertise to pick up. For instance, the cell shape could be changing in a consistent way even though I can't see any trends.
Cell number box plot for random 1000 compounds:
Continuing the discussion from #5.
We will select two batches that have two DMSO clusters and a subset of plates to explore normalization options. Three normalization ideas are:
For the extracted features, we may also try visualizing the data matrix as a heat map if it is feasible. We would keep the plate ordering and use hierarchical clustering of the features to group similar features.