Closed agitter closed 5 years ago
I tried to extract features from the first 243 plates (out of about 360 plates). It uses up 1.6TB/2TB in Gluster. The extracted features are just 5GB in total.
Some of the SQL files were corrupted, even after multiple attempts to transfer the files. We cannot find md5 checksums, so we may need to ask Anne Carpenter for help. We may have sufficient plates to continue our exploratory analysis and can follow up about the corrupted files later.
We believe that UMAP should work on all of the extracted images. Hierarchical clustering may not. There are special techniques for clustering large datasets, and we may want to look for Python implementations of those. AnHai Doan has worked on this problem. I will contact him.
We will also consider extracting features from earlier layers based on what we observed with the T cells.
AnHai and Adel's clustering algorithm is for bit vectors only. We can look at the clustering program matrix from Adel to see if he has identified more general approximate clustering packages.
We could approximate the hierarchical clustering ourselves by doing initial clustering per plate, merging compounds in the same clustering, and then clustering across plates. We prefer to use and off-the-shelf method though.
CHTC has high-memory machines that could be necessary for a full hierarchical clustering. We can contact them to learn what special considerations there may be for using these machines.
We can check whether scikit-learn uses multiple cores during hierarchical clustering. This may be controlled by NumPy or a lower library. We can confirm it works with our small scale tests by scaling from 1, 2, 4 cores.
UMAP likely can handle the full dataset, but if it fails it may be able to learn a stable mapping from the original feature space to the lower dimension space. If so, we could run UMAP on X% of the data and then map the 1-X % into the existing lower dimensional space. Then we could visualize compounds by plate or in smaller groups.
@xiaohk for the UMAP dimension reduction, what distance measure are you using? Should we use cosine similarity there as well to be consistent with the hierarchical clustering decision in #4?
Good suggestion! I was using the default euclidean
metric. The UMAP package doest does support cosine
, and it makes sense to keep the distance measure consistent.
The link above (https://umap-learn.readthedocs.io/en/latest/parameters.html#metric) had different UMAP metrics that are supported. Are those available in the version you have?
Oops, sorry. I was trying to say it "does support".
I tried to do UMAP feature visualization and hierarchical clustering on all combined features. Each image has 4,068 features and we have 340,304 images from 150 plates. The total feature size is 5 GB.
I use cosine
metric in UMAP, and it took about 5 hours to reduce the feature dimension. I plotted 50 features in one plot and there are 1,043 plots in total. I combine the DMSO features across all plates and count them as one type.
The result of plots looks weird. In half of the plates, majority points are in the upper cluster while points are located in the lower region in the other half.
I use cosine
metric in the hierarchical clustering. It has not finished yet (took more than 30 hours), but the job has not exceeded the 128GB memory limit either. Based to the results of UMAP, I guess the clustering might also yield "incorrect" results.
It seems tricky to run hierarchical clustering on multi-threads.
It is possible that these DMSO clusters represent technical or "batch" effects and that the signal in the image-based features is mostly driven by those instead of something biological.
We could cluster the 2D UMAP representation into 3 groups and then ask for each plate
Hopefully the representative images may show us whether there are obvious factors, such as the number of cells in the image, that dominate the signal and have a stronger effect than the morphology of the cells in the image.
If there are batch effects, we would need to work on a plate normalization. This could involve comparing images only to the DMSO images on the same plate.
Rerunning the feature extraction and UMAP gives the same result.
I am writing script to merge selected images so we can see the raw images more easily. It seems in some plates the majority images are black.
It seems in some plates the majority images are black.
This is an important observation. We can see whether the entropy or total intensity (per channel) distributions are also bimodal. That could indicate one of the UMAP clusters is empty images.
Revising the plan from https://github.com/xiaohk/pharmaco-image/issues/5#issuecomment-408903249, now it looks like there are only 2 clusters instead of 3.
Here are the gaussian mixture models clusters result:
I will try to massage the model to find better clusters.
The shape of distribution is different because I am using a different epoch number for the UMAP extraction. We still want to find 3 clusters.
To keep things simple, we could use 2 clusters at first. The clustering is very robust and we would not have to spend more time trying to perfect the clusters. My main interest here is to see whether there is a cluster bias plate-by-plate and whether random examples from each cluster are obviously different from each other in some way. Those should be attainable even with 2 clusters.
1bf4f4a6da8dfe4ffbff0444eea7de034edd2b04 adds the inspection code for 2 clusters.
I randomly sampled 100 non-DMSO shots from 150 plates, and get the following inspection result.
Some ideas for systematically exploring the trends noted above:
mean_well_profiles.csv
has continuous morphology features for each site. We can to t-tests for each of these features by comparing the distribution of feature values for group 0 images versus the distribution for group 1 images.extracted_features/24278.sqlite
has additional extracted features for cells in each site.Since the changes we have inspected are related to channel intensities, I first explored the relationship with mean_intensity
on well-level over plate numbers. The mean_intensity
here means the average intensity of all images taken on one well.
Below is the plot of 80 randomly sampled plates (total 244) of five channels. From this sample, we can infer a large variance of intensities over all plates.
Below is the cell_number
vs randomly 80 plates (not the same 80 plates as in the comment above):
Below is the mean_intensity
vs all 244 plates:
In the short term, we can send the final figure to Alex and Melissa to see if these intensity biases are common in microscopy and whether there are standard corrections.
There are standard ways to normalize images to have the same mean intensity, but if other attributes of the images are affected by the batch effect then a simple intensity correction will not work.
Ann Carpenter has work on weak supervision that could be relevant. In this dataset, we could constrain the representation of DMSO treated cells to be the same in all instances. That would force the network to learn how to remove the batch effect. We can assess whether it worked by inspecting the UMAP visualization.
There is also a less-related paper on batch normalization for single-cell RNA-seq data: https://www.biorxiv.org/content/early/2018/08/27/237065.1
For expediency, we could start by subsampling the DMSO treated images and and equal number of non-DMSO images.
We can also regenerate a version of the last plot for only the DMSO images.
Alex suggested normalizing images by their nearest DMSO image or by the cell count. Before attempting this normalization, should we look at the ratio of intensity / cell count for all plates to see if that is more homogeneous?
Here is the distribution of cell count over 244 plates:
The distribution of mean intensity of DMSO over 244 plates:
The ratio of intensity / cell count over 244 plates:
The averages of ratios across all plates look more consistent than the intensity plot. The distribution of outliers matches the pattern we have seen in the intensity plot.
We have downloaded all the plates. Here are the new bias inspection plots for all plates. As expected, the batch effects still exist. From the plots we can also find some problematic plate.
I have checked and re-downloaded the corrupted mega data files. Even though technically we can download 406 plates, there are only 349 entries listed on the provided checksum
file. I think we should consider those 349 plates as the "whole dataset". Below are the new bias distribution plots.
checksum
values.checksum
. The plots above only include 348 plates. Maybe we should report it to the dataset maintainer?checksum
file provided for the raw images, and we don't have a good way to systematically check all plates. The extracted features still miss 32 plates due to zip file incompleteness.I have divided the 348 plates into 8 batches based on their batch effect seen in the plots. Then we can assume that within one batch, all DMSO wells from different plates have the same distribution.
We then can use this assumption to improve our UMAP visualization. We used to consider all DMSO as one compound having the same effect (one color) and group all of them to compare with other compounds. Now, I think it is better to do in-batch comparisons.
Group 1 | Group 2 | Group 3 | Group 4 | Group 5 | Group 6 | Group 7 | Group 8 | |
---|---|---|---|---|---|---|---|---|
# of Plates | 14 | 69 | 28 | 62 | 28 | 45 | 30 | 41 |
Each plot below includes the first three plates within that batch.
Closed by #9
We can discuss whether this is an appropriate time to try a large-scale run over all the images. We could see if our conclusions about how few compounds affect the cells hold once we look at all of the compounds. This would also give us an initial estimate of how long it takes to do any processing on the entire dataset.