Project summary - Githubissues

Here is a list of our experiments and findings:

Use hierarchical clustering on image features extracted using a pre-trained CNN (#4)
- Different distance function gives different results
- We chose to use cosine as our distance function (used in UMAP as well)
- Hierarchical clustering does not scale well (https://github.com/gitter-lab/pharmaco-image/issues/5#issuecomment-408887512)

Use t-SNE and UMAP to visualize CNN features (#1, #2)
- Most plots show two clusters: a larger one and a smaller one
- Sometimes DMSO images are in the smaller cluster
- One cluster features images with less/no cells (https://github.com/gitter-lab/pharmaco-image/issues/5#issuecomment-421025418)
- By plotting with much more plates, we find the cluster distribution of DMSO images is inconsistent across plates. It leads to our observation of batch effects (https://github.com/gitter-lab/pharmaco-image/issues/5#issuecomment-408887512)

Batch effects detection (#9)
- Based on box plots of well mean_intensity vs. plate number on DMSO images, we are confident that batch effects exist in this dataset (https://github.com/gitter-lab/pharmaco-image/issues/5#issuecomment-438837856)
- By plotting UMAPs after dividing plates into artificially decided batches, the clusters become consistent within that batch
- Another good/popular method to detect batch effects is to plot the feature correlation heat map (https://github.com/gitter-lab/pharmaco-image/issues/9#issuecomment-456881553)
- Use interactive visualization to detect batch effects

Test compounds with known effects (#9)
- Compounds with different sensitivity distributes throughout the UMAP plots without an obvious pattern (https://github.com/gitter-lab/pharmaco-image/issues/9#issuecomment-467925356)
- We couldn't see trends when sorting cell images based on compound sensitivity (https://github.com/gitter-lab/pharmaco-image/issues/9#issuecomment-470124988)

Use cell image as compound feature to predict ExCAPE assay activity (#7)
- We have tried random forest with fingerprint, logistic regression with CNN features (before/after normalized), random forest with CNN features (after normalized), logistic regression with CellProfiler features.
- Results of above models are added below. We need more rigorous comparison if we want to draw conclusion from it.
- New direction is that we do not need a model that performs well on all assays. It is useful as long as it can predicts well on one assay (#12).
- Implement compound scaffold cross-validation, and try LeNet and VGG on selected assays. These models only predict one class now.

	Assays	F1	Accuracy	Average Precision	AUC	Precision	Recall
RF with fingerprint feature	209	13.18%	90.72%	31.74%	71.22%	30.44%	11.43%
LR with CNN feature (before normalization)	212-4	34.87%	84.48%	33.12%	85.22%	29.77%	75.17%
LR with CNN feature (after normalization)	209	17.02%	62.64%	15.54%	56.80%	14.59%	46.73%
RF with CNN feature (after normalization)	210	8.44%	90.49%	22.11%	70.69%	22.38%	8.81%
LR with CellProfiler mean-well feature	206	24.74%	81.07%	24.54%	69.52%	20.83%	42.27%.

gitter-lab / pharmaco-image