High-throughput run over all plates

agitter commented 6 years ago

We can discuss whether this is an appropriate time to try a large-scale run over all the images. We could see if our conclusions about how few compounds affect the cells hold once we look at all of the compounds. This would also give us an initial estimate of how long it takes to do any processing on the entire dataset.

xiaohk commented 6 years ago

I tried to extract features from the first 243 plates (out of about 360 plates). It uses up 1.6TB/2TB in Gluster. The extracted features are just 5GB in total.

25 hours → download the first 100 plates
11 hours → 3 + 8 hours to extract the 80 plates
36 hours → download 143 plates
32 hours → 22 + 10 hours to extract the 143 plates

agitter commented 6 years ago

Some of the SQL files were corrupted, even after multiple attempts to transfer the files. We cannot find md5 checksums, so we may need to ask Anne Carpenter for help. We may have sufficient plates to continue our exploratory analysis and can follow up about the corrupted files later.

We believe that UMAP should work on all of the extracted images. Hierarchical clustering may not. There are special techniques for clustering large datasets, and we may want to look for Python implementations of those. AnHai Doan has worked on this problem. I will contact him.

We will also consider extracting features from earlier layers based on what we observed with the T cells.

agitter commented 6 years ago

AnHai and Adel's clustering algorithm is for bit vectors only. We can look at the clustering program matrix from Adel to see if he has identified more general approximate clustering packages.

We could approximate the hierarchical clustering ourselves by doing initial clustering per plate, merging compounds in the same clustering, and then clustering across plates. We prefer to use and off-the-shelf method though.

CHTC has high-memory machines that could be necessary for a full hierarchical clustering. We can contact them to learn what special considerations there may be for using these machines.

We can check whether scikit-learn uses multiple cores during hierarchical clustering. This may be controlled by NumPy or a lower library. We can confirm it works with our small scale tests by scaling from 1, 2, 4 cores.

UMAP likely can handle the full dataset, but if it fails it may be able to learn a stable mapping from the original feature space to the lower dimension space. If so, we could run UMAP on X% of the data and then map the 1-X % into the existing lower dimensional space. Then we could visualize compounds by plate or in smaller groups.

agitter commented 6 years ago

@xiaohk for the UMAP dimension reduction, what distance measure are you using? Should we use cosine similarity there as well to be consistent with the hierarchical clustering decision in #4?

xiaohk commented 6 years ago

Good suggestion! I was using the default euclidean metric. The UMAP package ~~doest~~ does support cosine, and it makes sense to keep the distance measure consistent.

agitter commented 6 years ago

The link above (https://umap-learn.readthedocs.io/en/latest/parameters.html#metric) had different UMAP metrics that are supported. Are those available in the version you have?

xiaohk commented 6 years ago

Oops, sorry. I was trying to say it "does support".

xiaohk commented 6 years ago

I tried to do UMAP feature visualization and hierarchical clustering on all combined features. Each image has 4,068 features and we have 340,304 images from 150 plates. The total feature size is 5 GB.

UMAP

I use cosine metric in UMAP, and it took about 5 hours to reduce the feature dimension. I plotted 50 features in one plot and there are 1,043 plots in total. I combine the DMSO features across all plates and count them as one type.

The result of plots looks weird. In half of the plates, majority points are in the upper cluster while points are located in the lower region in the other half.

umap_24618_2

umap_25568_1

umap_25704_3

umap_25852_7

I generated the features in multiple jobs. Even if the codes are all identical, It might somehow introduced variance in the feature representation? I will try to regenerate features in one job and plot the UMAP again.
All features fall in the DMSO region. If my feature representation is correct, it can be a bad sign. If the results look the same after I regenerate the features, we want to examine the raw images of representative images.
At least we know UMAP scales very well on our dataset.

Hierarchical Clustering

I use cosine metric in the hierarchical clustering. It has not finished yet (took more than 30 hours), but the job has not exceeded the 128GB memory limit either. Based to the results of UMAP, I guess the clustering might also yield "incorrect" results.

It seems tricky to run hierarchical clustering on multi-threads.

agitter commented 6 years ago

It is possible that these DMSO clusters represent technical or "batch" effects and that the signal in the image-based features is mostly driven by those instead of something biological.

We could cluster the 2D UMAP representation into 3 groups and then ask for each plate

What fraction of DMSO images fall into each group?
What fraction of non-DMSO images fall into each group? That may tell us whether the DMSO images give an indication of the "batch".

Hopefully the representative images may show us whether there are obvious factors, such as the number of cells in the image, that dominate the signal and have a stronger effect than the morphology of the cells in the image.

If there are batch effects, we would need to work on a plate normalization. This could involve comparing images only to the DMSO images on the same plate.

xiaohk commented 6 years ago

Rerunning the feature extraction and UMAP gives the same result.

umap_24277_5 umap_25639_1 umap_25858_5

I am writing script to merge selected images so we can see the raw images more easily. It seems in some plates the majority images are black.

agitter commented 6 years ago

It seems in some plates the majority images are black.

This is an important observation. We can see whether the entropy or total intensity (per channel) distributions are also bimodal. That could indicate one of the UMAP clusters is empty images.

Revising the plan from https://github.com/xiaohk/pharmaco-image/issues/5#issuecomment-408903249, now it looks like there are only 2 clusters instead of 3.

xiaohk commented 6 years ago

Here are the gaussian mixture models clusters result:

umap_24311_1

I will try to massage the model to find better clusters.

The shape of distribution is different because I am using a different epoch number for the UMAP extraction. We still want to find 3 clusters.

agitter commented 6 years ago

To keep things simple, we could use 2 clusters at first. The clustering is very robust and we would not have to spend more time trying to perfect the clusters. My main interest here is to see whether there is a cluster bias plate-by-plate and whether random examples from each cluster are obviously different from each other in some way. Those should be attainable even with 2 clusters.

xiaohk commented 6 years ago

1bf4f4a6da8dfe4ffbff0444eea7de034edd2b04 adds the inspection code for 2 clusters.

I randomly sampled 100 non-DMSO shots from 150 plates, and get the following inspection result.

Group 0 Channel 123

g0c123

Group 1 Channel 123

g1c123

Group 0 Channel 45

g0_c45

Group 1 Channel 45

g1c45

Comments

The main difference between group 0 and group 1 is the image intensity. Group 1 seems to have higher intensity on average in all channels, but there are some exceptions in group 0.
Group 1 images have obvious gaps between cells.
Are these differences biological signals?
This doesn't really explain the inconsistency of clustering across plates. The images are sorted by plate numbers, and we can see there seems to be some "consistent" bias in the order of plates. It may relate to the observation that clustering flipped after a certain plates while processing plates in order.
I will do more plate-specific inspections.

agitter commented 6 years ago

Some ideas for systematically exploring the trends noted above:

To check if there is an association between plate number and group (0 or 1), we can plot the distribution of plate numbers for all images in each group. That is, a separate histogram of plate numbers for group 0 images and then group 1 images.
We could also repeat this for well and site numbers.
mean_well_profiles.csv has continuous morphology features for each site. We can to t-tests for each of these features by comparing the distribution of feature values for group 0 images versus the distribution for group 1 images.
Similarly, extracted_features/24278.sqlite has additional extracted features for cells in each site.
Plate number versus number of cells could also reveal biases or trends.
To eliminate any real biological effects, we could also do this for only the subset of sites that were treated with DMSO.

xiaohk commented 6 years ago

Since the changes we have inspected are related to channel intensities, I first explored the relationship with mean_intensity on well-level over plate numbers. The mean_intensity here means the average intensity of all images taken on one well.

Below is the plot of 80 randomly sampled plates (total 244) of five channels. From this sample, we can infer a large variance of intensities over all plates.

mean_intensity_80

The intensity values can be grouped into multiple chunks, and the distribution of chunks are rather consistent over five channels. It matches the sudden changes we have seen in clustering.
There are extremely small variances of wells within some plates in the middle of the plot. All sampled plates have the same number of wells (384).
The majority outliers are above the median but it flips for the last few plates.

xiaohk commented 6 years ago

Below is the cell_number vs randomly 80 plates (not the same 80 plates as in the comment above):

cell_number

Below is the mean_intensity vs all 244 plates:

all_intensity

agitter commented 6 years ago

In the short term, we can send the final figure to Alex and Melissa to see if these intensity biases are common in microscopy and whether there are standard corrections.

There are standard ways to normalize images to have the same mean intensity, but if other attributes of the images are affected by the batch effect then a simple intensity correction will not work.

Ann Carpenter has work on weak supervision that could be relevant. In this dataset, we could constrain the representation of DMSO treated cells to be the same in all instances. That would force the network to learn how to remove the batch effect. We can assess whether it worked by inspecting the UMAP visualization.

There is also a less-related paper on batch normalization for single-cell RNA-seq data: https://www.biorxiv.org/content/early/2018/08/27/237065.1

For expediency, we could start by subsampling the DMSO treated images and and equal number of non-DMSO images.

We can also regenerate a version of the last plot for only the DMSO images.

agitter commented 6 years ago

Alex suggested normalizing images by their nearest DMSO image or by the cell count. Before attempting this normalization, should we look at the ratio of intensity / cell count for all plates to see if that is more homogeneous?

xiaohk commented 6 years ago

Here is the distribution of cell count over 244 plates:

cell_number

The distribution of mean intensity of DMSO over 244 plates:

dmso_intensity

The ratio of intensity / cell count over 244 plates:

mean_intensity_cell_count

The averages of ratios across all plates look more consistent than the intensity plot. The distribution of outliers matches the pattern we have seen in the intensity plot.

xiaohk commented 6 years ago

We have downloaded all the plates. Here are the new bias inspection plots for all plates. As expected, the batch effects still exist. From the plots we can also find some problematic plate.

xiaohk commented 5 years ago

I have checked and re-downloaded the corrupted mega data files. Even though technically we can download 406 plates, there are only 349 entries listed on the provided checksum file. I think we should consider those 349 plates as the "whole dataset". Below are the new bias distribution plots.

Mean Intensity

full size image

Mean Intensity of DMSO

full size image

Comments

It is very common to download corrupted files, and some files seem to prone to packet loss. I repeated check-and-download 4 times to make all metadata files have the same checksum values.
There is still one broken megadata tarball (plate 25575) even it has the correct checksum. The plots above only include 348 plates. Maybe we should report it to the dataset maintainer?
There is no checksum file provided for the raw images, and we don't have a good way to systematically check all plates. The extracted features still miss 32 plates due to zip file incompleteness.

xiaohk commented 5 years ago

Batch Assumption and UMAP Visualization

I have divided the 348 plates into 8 batches based on their batch effect seen in the plots. Then we can assume that within one batch, all DMSO wells from different plates have the same distribution.

We then can use this assumption to improve our UMAP visualization. We used to consider all DMSO as one compound having the same effect (one color) and group all of them to compare with other compounds. Now, I think it is better to do in-batch comparisons.

Group 1	Group 2	Group 3	Group 4	Group 5	Group 6	Group 7	Group 8
# of Plates	14	69	28	62	28	45	30	41

Each plot below includes the first three plates within that batch.

Comments

Clusters in each batch are very different from clusters from the other batches. It was indeed confusing to mix all DMSO wells together.
All UMAP features are generated together. Therefore, we can stack all batches in the 2D space. Can batch effects explain the significant changes of distributions?
Within each batch, there are always two major clusters. I guess one of them includes dim images.

agitter commented 5 years ago

Closed by #9

gitter-lab / pharmaco-image