Open xiaohk opened 5 years ago
I'm adding some initial thoughts, and we can discuss this more in person.
A lot of the effort was spent exploring batch effects. We did not find too much explicit guidance about how to detect and correct for batch effects in this type of data. A paper that highlighted the batch effects and offered analyses showing pros and cons of different correct strategies may not be the most exciting, but it would be valuable. We would need to do more thorough literature searching to make sure there is not already highly similar related work.
All of the work to combine ExCAPE with the Cell Painting dataset is also valuable. Even without the downstream predictive modeling, we could write about how to align these two datasets and make the resource available. That would be highly derivative of the existing datasets though, even though it is non-trivial to combine them.
Lastly, we have the assay activity prediction work. This story would be more complete if the LeNet or VGG CNNs worked. It is counter intuitive that the LR with CNN features is worse after normalization. It might take a lot of work and compute time to do final runs for rigorous comparisons.
Here is a list of our experiments and findings:
cosine
as our distance function (used in UMAP as well)mean_intensity
vs. plate number on DMSO images, we are confident that batch effects exist in this dataset (https://github.com/gitter-lab/pharmaco-image/issues/5#issuecomment-438837856)DMSO within-plate
,All feature within-plate
,DMSO across-plate
DMSO within-plate
andAll feature within-plate
can remove batch effects (https://github.com/gitter-lab/pharmaco-image/issues/9#issuecomment-461027396)