gitter-lab / pharmaco-image

MIT License
1 stars 0 forks source link

Accurate Prediction of Biological Assays with High-Throughput Microscopy Images and Convolutional Networks #15

Open agitter opened 4 years ago

agitter commented 4 years ago

https://doi.org/10.1021/acs.jcim.8b00670

Predicting the outcome of biological assays based on high-throughput imaging data is a highly promising task in drug discovery since it can tremendously increase hit rates and suggest novel chemical scaffolds. However, end-to-end learning with convolutional neural networks (CNNs) has not been assessed for the task biological assay prediction despite the success of these networks at visual recognition. We compared several CNNs trained directly on high-throughput imaging data to a) CNNs trained on cell-centric crops and to b) the current state-of-the-art: fully connected networks trained on precalculated morphological cell features. The comparison was performed on the Cell Painting data set, the largest publicly available data set of microscopic images of cells with approximately 30,000 compound treatments. We found that CNNs perform significantly better at predicting the outcome of assays than fully connected networks operating on precomputed morphological features of cells. Surprisingly, the best performing method could predict 32% of the 209 biological assays at high predictive performance (AUC > 0.9) indicating that the cell morphology changes contain a large amount of information about compound activities. Our results suggest that many biological assays could be replaced by high-throughput imaging together with convolutional neural networks and that the costly cell segmentation and feature extraction step can be replaced by convolutional neural networks.

This looks very relevant to our project. I believe it is the updated version of the paper from #7. We should see what their main conclusions are to help us decide what to do next.

I don't necessarily agree with their line in the abstract that looking at how many assays have AUC > 0.9 is the right measure of success. However, if we review our existing models, do we know how well they do by this metric?

agitter commented 4 years ago

The senior author also sent me the link to their code https://github.com/ml-jku/hti-cnn

xiaohk commented 4 years ago

Yes, they are the same authors of the paper from #7.

We convert the raw 16bit TIFF files to 8bit to reduce data loading time. In doing so we also remove extreme outlier pixel values by removing the 0.0028% highest pixels prior to this conversion.

The original raw images are actually 12bit. I converted them to 16bit in my pipeline, perhaps it is better to convert them to 8bit.

Furthermore, we normalize each image individually to mean zero and a standard deviation of one. This strategy can be viewed as illumination correction.

I didn't do this. It might also help removing batch effect.

we did not combine these images to obtain one large image per screen but rather used each view image individually for training and only combined the network outputs by averaging predictions.

I treat each view as a single instance, whereas they treat each well (6 views) as one instance during testing.

Therefore, the loss for all output units for unlabeled assays for a given sample were masked by multiplying it with zero before performing back-propagation to update the parameters of the network during training.

They need to do this for multi-task learning. I didn't use missing label at all during training (single-task).

We identified each chemical compound in the large bioactivity database ChEMBL. In this way, we obtained labels for this data set as drug activity data for 10,574 compounds across 209 assays (we only included assays for which at least 10 active and 10 inactive measurements exist).

With ExCAPE-DB, our output matrix is actually 27241 compounds x 212 assays. We have much more compounds overlapped. So arguably our setup is more challenging.


We have 20 out of 209 assays (9.57%) with AUC>0.9 using random forest with fingerprint, 12 out of 206 assays (5.83%) with AUC > 0.9 using logistic regression with CellProfiler features, and 93 out of 208 assays (44.71%) with AUC > 0.9 using logistic regression with CNN features.