High-throughput run over all plates (continued)

agitter commented 5 years ago

Continuing the discussion from #5.

We will select two batches that have two DMSO clusters and a subset of plates to explore normalization options. Three normalization ideas are:

Normalize the original images using the mean channel intensities
Extract features with the CNN and treat those as a new data matrix. Explore which standard batch effect correction methods, e.g. for gene expression data, may be applicable.
Train a new Siamese CNN to learn how to normalize the images by constraining images with the same labels to have the same latent representation. We will use the Carpenter lab approach initially if it is applicable.

For the extracted features, we may also try visualizing the data matrix as a heat map if it is feasible. We would keep the plate ordering and use hierarchical clustering of the features to group similar features.

xiaohk commented 5 years ago

Heatmap for DMSO images

Since there are many corrupted raw images, we only had features extracted from 317 plates. Then, for each plate, I randomly sample 5 images, from random DMSO wells and random field of view. It gave me 317 * 5 = 1585 rows.

Then we can visualize this 1585 * 4096 matrix.

heat_map_dmso_5

Heatmap for Non-DMSO images

Following the same procedure, I sampled 1585 non-DMSO images. We also can visualize this 1585 * 4096 matrix.

heat_map_no_dmso_5

Comments

The original feature matrix has a very large range (from 0 to about 30000). Therefore, I linearly scaled the values into [0, 1] for each column.
There are regions in the matrix that are always sparse. For each row, there are many columns sharing the same value.
Note that when we extract features from the raw images, we split 5 channels into one 3-channel image and another 2-channel image. Each image gives us 2048 features, then we merge two vectors to get 4096 features for each 5-channel image. Therefore, we see some repetitive patters for each row.
I added a color bar on the left to indicate which batch is the row comes from.

The mapping is {1 => red, 2 => orange, 3 => yellow, 4 => green, 5 => skyblue, 6 => blue, 7 => purple, 8 => black}.
From the color bar, we can see that batch 2 (orange) and batch 3 (yellow) are picked up by the hierarchical clustering, while others are not so obvious.
Batches 1, 2, 3 (red, orange, yellow) tend to be grouped together, perhaps we can group batches 1, 2, 3 together? See the new plots if I group 1, 2, 3 as red, 4, 6, 8 as blue, 5, 7 as green. Maybe the plate number does not necessarily follows the imaging order?

heat_map_dmso_5_3batch

Also see our batch grouping in the DMSO distribution.

xiaohk commented 5 years ago

I subsampled features of 5 plates in batch 3, and 5 plates in batch 4 to test batch effect adjustment algorithms. It gave me a (41410, 4096) feature matrix.

Then, I used the ComBat() function from the R package sva to remove the batch effect in this feature matrix. sva is a popular package for gene expression/RNA sequencing, and ComBat() uses Bayes method to adjust batch effects.

I visualized the heat map (4096, 4096) for the feature matrix. In each heatmap, top 2048 rows are randomly sampled from the rows in batch 3, while the bottom 2048 rows are from batch 4. Two heatmaps have the same sampling index.

Before adjustment

full size image

After adjustment

full size image

Comments

Each row in the heatmap looks symmetric. This is because we are merging two 2048 features (from channel-123, channel-45).
ComBat() really removes the batch effects. We also can plot the UMAP before/after normalization, but I am pretty confident that two batches should have a similar distribution in the 2D space.
This normalization method greatly modify the feature values (especially rows from batch 3). I am not sure if is the right way to do. One way to validate it is to see if it improves our drug test prediction. I think changing the original pixels makes more sense, because we can check if the images make sense after normalization.

Some other ideas for normalization

Design an optimization problem to find a linear transformation to transform all off-batch feature vectors to minimize the "batch distance". We only use DMSO images in the optimization. It might relate to multi response regression.
There are a lot of research in this problem. One key word is "calibration transfer".

agitter commented 5 years ago

We can try the sva function in the sva package to do an unsupervised analysis on the same data used for the supervised ComBat analysis run above. Here supervised refers to having the batch identities.

We will need a way to evaluate the three current normalization strategies:

ComBat (as above) on the full dataset (or very large subsampled dataset)
sva
Intensity normalization directly on the input images before feature extraction

One approach would be to normalize, run UMAP on the feature matrix, and visualize the DMSO images in the new 2D space from UMAP. If the normalization worked well, we might hope to see a single DMSO cluster now instead of the dual clusters we saw for some batches in #5.

We will also continue thinking about the calibration transfer strategies, possibly including the CNN approach we discussed before.

We could also consider encoding the batch (or plate?) as a feature and use that when predicting drug responses. The batches have clear visual differences but we created them, whereas the plates are directly technical indices.

xiaohk commented 5 years ago

Use UMAP to measure normalization performance

bc2547babfcbcfe07dbab68cf945eefa4f4e1ebe adds UMAP plots for CombatP (Parametric Combat) normalization. Surprisingly, two batches are separated in the UMAP 2D space after normalization, even though we have seen a consistent heat map in https://github.com/gitter-lab/pharmaco-image/issues/9#issuecomment-444744466.

test

After a successful normalization, we expect to see overlapping DMSO dots from two batches.
CombatP does have some effects, but it does not remove the batch effects
Maybe heat maps overlook the small connection between each image in one batch?

Other normalization

`sva`

I have been trying to use sva to get batch labels, instead of setting them artificially. Since sva requires model matrix and other parameters, I found it hard to set up for our feature matrix. And sva is very slow.

Mean Normalization

I found a paper Removing Batch Effects From Histopathological Images for Enhanced Cancer Diagnosis which tried 5 different methods to remove the batch effect of histopathological images (mean, rank, ratio, combatP, combatN). It concluded that combatN has the most positive impact on later prediction performance. Mean normalization also has a good result.

nihms799862f9

Therefore, I tried to use their mean normalization method on our (41410, 4096) feature matrix.

The result is better than combatP method:

It changes feature values more progressively than combatP
It is interesting that two clusters have "perpendicular" orientations in the 2D space. I am not sure if it means anything
It makes sense in the 2D space that two batches now have the same mean (same cluster center), and the same standard deviation (same spread)

combatN

CombatN is extremely slow. It runs 5 hours (only for 10 plates) on Condor. It failed to find adjustment (result matrix contains only NA values). I will try to rerun the job again.

Other comments

I was worried about direct normalization on extracted features might cause problems, but it seems okay based on the paper.
There is only few research studying image batch effects.
We can look deeper into it and perhaps set up a design like this paper, which has known batches. We can compare network-based normalization to baselines like mean normalization. We probably can get more experiment information from Broad Institute to create a more "official/correct" batch separation.

agitter commented 5 years ago

The standard batch effect correction methods are either too slow or unsatisfactory. The network-based normalization could still possibly work but will be hard to implement and execute.

We could also contact the Broad team again to point out what we perceive as a batch effect and ask for suggestions. Instead of going to the authors directly, we can post to the meta-image analysis board that supports Cell Profiler, ImageJ, etc.

xiaohk commented 5 years ago

The imaging forum (image.sc) is very helpful. Someone actually asked questions about the Cell Painting data we are using.

Based on the replies, Carpenter Lab has noticed the batch effects. For the provided CellProfiler profiling features, they have processed per-plate normalization using z-score method.

Carpenter Lab has published a very nice paper discussing the cell image processing procedure in their lab. For batch effect detection, it suggests to aggregate the negative control feature for each plate (median is recommended), and then visualize the correlation matrix.

For batch effect removal, it recommends us to use DMSO (negative control) as the "normalizing population," then apply z-score normalization (similar to the mean-method we have used) within each plate, directly on every feature.

Relative normalization. This procedure consists of computing statistics (for example, median and median absolute deviation) in one population of samples, and then centering and scaling the rest with respect to that population. Ideally, features are normalized across an entire screen in which batch effects are absent; however, normalization within plates is generally performed to correct for batch effects (described in 'Batch-effect correction'). When choosing the normalizing population, we suggest the use of control samples (assuming that they are present in sufficient quantity), because the presence of dramatic phenotypes may confound results. This procedure is good practice regardless of the normalization being performed within plates or across the screen. Alternately, all samples on a plate can be used as the normalizing population when negative controls are unavailable, too few, or unsuitable for some reason, and when samples on each plate are expected to not be enriched in dramatic phenotypes.

We recommend applying normalization across all features. Normalization can be applied even if features are not transformed, and it is preferable to remove biases while simultaneously fixing range issues. z-score normalization is the most commonly used procedure in our laboratories. Normalization also aligns the range of different features, thus decreasing the effects of unbalanced scales when computing similarities (described in 'Measuring profile similarity') or applying analysis algorithms (described in 'Downstream analysis'). It is advisable to compare several transformation and normalization methods, because their performance can vary significantly among assays.

xiaohk commented 5 years ago

Using the recommended batch effect detection method, we can get the following correlation heatmap. I used the "elemental-wise" median of DMSO CNN feature (4096) in each plate to compute this 317*317 correlation matrix. The horizontal and vertical axises are the sorted 317 plate ID's.

feature_median_cor_317

We can infer batch effects based on the non-uniform distribution of squares.
Depending on how we group the diagonal squares, there can be 6~10 batches. It matches what we have seen in the mean intensity plot (~8 batches).
We can also infer some outliers using this method. These outliers match the outliers on our mean intensity plot.
Compared to our previous detection method, this way is more indirect (based on feature instead of image pixels), but it is more light-weight (no need 5 channels). It is also easier to compute after extracting the features. We can adopt this method in future projects.

Next step:

Visualize the correlation matrix after normalization.
I have redownloaded the corrupted images. We now have the complete 406 plates. I am extracting features from the images (10 jobs running for 2 days already).
Use z-score normalization on all extracted features, and assess its performance.

xiaohk commented 5 years ago

I have tried the z-score normalization on all extracted features. The results do not look very good, so I tried some other methods.

One issue I have encountered is the variance of some features on the normalizing population is 0, but they are not 0 on treated images. I have discussed it with Shantanu on image.sc forum. I believe he knows we are working on the Cellpainting data.

My solution is to use the feature variance of all images instead of DMSO images if the latter variance is 0. If the variance for all images in that plate is 0, I then stop normalizing that feature (mean = 0, std=1) and report it. The idea is applied to all three normalization below (use super set to compute std).

Method	Description
DMSO Within-plate Normalization	Use in-plate DMSO features to compute mean and sd for each plate
All-Feature Within-plate Normalization	Use all in-plate features to compute mean and sd for each plate
DMSO Across-plate Normalization	Use all DMSO features across all plates to compute mean and sd for all plates

406 Plates Before Normalization

406 Plates After DMSO Within-plate Normalization

406 Plates After All-Feature Within-plate Normalization

406 Plates After DMSO Across-plate Normalization

Comments

It becomes harder to interpret the heatmap.
I expect the perfect heatmap to have light light yellow diagonal, and random dark off-diagonal entries.
The correlation between different plates significantly drops after all normalization methods, but the "square patches" are still very obvious.
It seems "All-Feature Within-plate Normalization" gives the best result (not perfect), but this is not the one suggested on the paper.

Next Steps

Maybe ask Shantanu if "All-Feature Within-plate Normalization" makes sense?
Use alternative visualization to assess batch effects (UMAP)
Use "All-Feature Within-plate" normalized features to predict assay activity with LR

agitter commented 5 years ago

These next steps make sense. UMAP has been helpful in visualizing the batch effects in the past. We are comfortable sharing these heatmaps in the image.sc forum if it can help us determine the best normalization.

xiaohk commented 5 years ago

I asked for suggestions here, but Shantanu didn't reply. I guess these questions are too detailed, and not related to CellProfilers.

Then I tried to revitalize the normalization results using UMAP with some slight visualization changes.

Workflow

Plot the mean intensity distribution by sorted plate number of 406 plates
Manually decide batches based on the plot (we get 10 batches now, used to have 8 batches)
Compute UMAP 2D values for all the four versions of features (before normalization, After DMSO Within-plate Normalization, After All-Feature Within-plate Normalization, After DMSO Across-plate Normalization)
For each version of features, collect all DMSO 2D values (grouped by batch)
For each version of features, randomly sample 5 plates from each batch
For each version of features, randomly plot 10 plots of 40 treated wells from those 5 plates with DMSO values from step 4
For each version of features, organize all plots for 10 batches in one image, so we can compare the batch-to-batch similarity

406 Plates Before Normalization

406 Plates After DMSO Within-plate Normalization

406 Plates After All-Feature Within-plate Normalization

406 Plates After DMSO Across-plate Normalization

Comments

It seems both DMSO Within-plate Normalization and All-Feature Within-plate Normalization strategies are working.
UMAP visualization is more intuitive than the correlation heatmap, but UMAP takes much longer time
To make the visualization easier, I sample both wells (~400 wells) and plates (5 plates). I think my samples are representative.
Ideally, if there is no batch effects, then the distribution of DMSO images should be similar across all batches. It is very interesting that the clusters within-batch are also unified after normalization.
Since Carpenter Lab suggested DMSO Within-plate Normalization method, we can use this version of features in the proceeding training.

agitter commented 5 years ago

I agree DMSO Within-plate Normalization looks great. We can proceed with that.

Now that we are happy with the 2D UMAP representations after this normalization, we can spot check individual images to confirm the extracted features and UMAP 2D coordinates make sense with respect to the original images. If we fit a 2D Gaussian to the DMSO images in the 2D space, we can score each treated image by the likelihood it was generated from that Gaussian. Then we can look at 10-100 images with the highest likelihood and 10-100 with the lowest. Perhaps we do this for more than one batch.

One hypothesis is still that the outliers here are those where the chemicals have the greatest effect on the cancer cells. That may be a research question we can explore next. Before trying to systematically validate the outliers, we could look at some that are strong outliers and manually search ChEMBL or other databases to see what evidence exists that we can use to support our predictions.

xiaohk commented 5 years ago

I was preparing for an admission interview this week. To show that professor I am capable of designing and implementing visual analytics systems, I designed a visualization for flexibly viewing a million of UMAP dots.

The link above should be private. He gave us lots of suggestions in terms of this visualization. The plot is still sketchy. I can give you a demo in our meeting, and we can discuss if this plot is helpful.

agitter commented 5 years ago

Very nice! I'm curious to hear the suggestions. Did you do this directly in D3 or does it use some library?

xiaohk commented 5 years ago

UMAP Viewer Visualization

Yes, I am directly using D3.js and HTML SVG. It might not be our main interest, but I got many fun suggestions from my interviewer and Michael Gleicher.

Functions and goals

Sliding a window through various sizes over controlled images to detect batch effects
Find outlier in treated images
A more compact and flexible way to visualize UMAP dots than tons of static plots

Limitations

Very slow rendering when the window size gets large.
I did not implement outlier detection interaction yet (could further slow the rendering)
Scatter plot overdraw issue

Feedbacks

Draw rectangles instead of circles to make rendering faster
Use a lower level framework: D3.js + SVG → D3.js + Canvas → WebGL
Use binned scatter plot (paper) to replace scatter plot (I didn't know there is lots of research on this area)

Gaussian Mixture 2D Model

I have tried to fit a Gaussian 2D distribution on all normalized DMSO UMAP values, then I can automatically detect outliers of non-DMSO UMAP points.

If I fit the mixture model using component=1, then this model predicts all non-DMSO dots have 100% probability of falling in the DMSO cluster.

If I fit the mixture model using component=2, then it fits two 2D Gaussian distributions within the DMSO cluster (departed by a y threshold). It does not really help us detect outliers.

test

test2

I manually picked outliers from 3 regions: far top (y > 10), far right (x > 10), cluster bottom edge area (y < -5) to inspect the raw images. I am fixing a bug in my image inspect code.

xiaohk commented 5 years ago

Group	Channel 123	Channel 45
y>10
y<-5
x>10

agitter commented 5 years ago

Even with these outliers, it ends up being difficult for us to manually evaluate whether they are reasonable or not. The x > 10 example could be an effective drug that killed all the cancer cells or a technical artifact with no cells in the viewing frame.

Another direction would be to look for positive controls, known effective drugs. If these are reported in the Gigascience paper, we could find the corresponding 2D coordinates and look at the images.

The Broad Institute and Sanger have projects that screen drugs against many types of cancer cell lines. If they have U2OS data, it could also give us positive controls.

agitter commented 5 years ago

Based on #7, we have concerns about over-normalization and removing the signal. We may be able to design more direct tests of the normalization. For instance, if we select some training batches and a different test batch, can we predict DMSO versus non-DMSO on the test batch successfully? We could directly compare results for the original features and normalized features. The normalization is within plate, so it should be fair to normalize even the test batch images.

agitter commented 5 years ago

The Broad Institute and Sanger have projects that screen drugs against many types of cancer cell lines. If they have U2OS data, it could also give us positive controls.

Both of these include U2OS drug screening data and can serve as positive controls of drugs that kill the cancer cells if we need those. We should also check whether the Gigascience paper discusses these screening datasets and any overlapping chemicals.

The Broad data was originally reported in https://www.nature.com/articles/nature11003#supplementary-information I checked the supplementary Excel file to confirm U2OS is there. It may be more direct to register for their data portal to access the data though: https://portals.broadinstitute.org/ccle/about

The Sanger drug sensitivity data is https://www.cancerrxgene.org/translation/CellLine/909776 They used z-scores to assess sensitivity and resistance. From the plot, it looks like no drugs passed their -2.0 threshold for sensitivity. We could still potentially learn what the most sensitive versus the most resistant chemicals do to the cells.

xiaohk commented 5 years ago

e10df9bc80588526266e9cc3f3992f6c18e471be adds analysis of these positive control compounds.

There are only 23 compounds tested on U2OS in the CCLE dataset, and I am not sure how to read the drug sensitivity from this dataset
The drug name in Sanger dataset is also more "standard" than CCLE. I have tried both clue.io API and PubChem API to convert compound name to SMILES/InChi. PubChem works better for me.
There are 34 compounds that are not in PubChem. I can find 18 of them in the Harvard LINCS database. They are all called "Small Molecule".
There are 196 compounds tested on U2OS in the Sanger dataset. If I use canonical SMILES to match compounds with our compounds, there are 7 overlapping drugs. If I used InChi to match, we can get 11 overlapping compounds.
These 11 compounds are distributed in 84 wells in our dataset. There are 458 images in total.

Below is their UMAP project on the pre-normalized DMSO space:

pre_norm_overlap

Below is their UMAP project on the normalized DMSO space:

post_norm_overlap

We would expect high sensitivity and low sensitivity compounds to have their own clusters, but they are distributed "uniformly" here.
Based on the Z score, majority of overlapping compounds do not exceed the thresholds (either sensitive or resistant). The min is -0.9554, and the max is 1.9195. That might be the reason we have a uniform distribution here.

agitter commented 5 years ago

The running jobs to look at the images from the max and min z-scores may still be informative. Otherwise, these results are somewhat confusing. We might expect that the images for the sensitive drugs with the most positive z-scores have fewer cells or show some sign that the cells are being affected or killed. In the 2D plots, however, the yellow and dark blue points are sometimes adjacent and the yellow points are distributed throughout the 2D space.

xiaohk commented 5 years ago

We have 11 compounds with known effects overlapping with our image dataset. This corresponds to 504 images. I randomly sampled 100 images from channel 123 and channel 45 respectively. Then, I sorted these raw images by their drug sensitivity Z scores.

Channel 123

Channel 45

Comment

From the 100 raw images with channel 123, we can see that compounds having low or high Z-scores could have few cells in this dataset. It might tell us cell painting data do not really encode cell biological signals.
The image color distributes uniformly over increasing Z-scores. It also suggests there is no correlation between sensitivity Z-scores and cell images.

agitter commented 5 years ago

My first observations were the same as yours, which is not encouraging. However, there may be more subtle biological cues for which we both lack suitable expertise to pick up. For instance, the cell shape could be changing in a consistent way even though I can't see any trends.

xiaohk commented 5 years ago

Cell number box plot for random 1000 compounds:

test

gitter-lab / pharmaco-image