Closed xiaohk closed 2 years ago
I noted that they do include U2OS as one of their four cell types, but it is the least-frequently surveyed of the four. The focus on separating biological from technical effects is interesting. I wonder whether that implies they are still struggling with batch effects internally?
The paragraph on how not to use the dataset was pretty interesting:
As the images in RxRx1 are generated by carrying out biological experiments using reagents known as siRNAs, which are designed to target and knockdown a specific gene (more on this in another section), some may be tempted to use this to identify gene-specific morphological changes. DO NOT DO THIS. siRNAs are known to have significant off-target effects which you only have the chance to overcome through a number of computational methods and using multiple siRNAs per gene. As this dataset only includes one siRNA per gene for a random subset of genes, do not attempt to identify gene-specific signal. There are many ways you can convince yourself you have succeeded in this. You will be wrong. The data provided is insufficient for that task, and should thus be used to conduct research focused on alternative problems only. Just for clarity because we know somebody will ignore the warnings above, we’ll state it again more clearly: DO NOT USE THIS DATASET TO TRY TO GET AT GENE-SPECIFIC CHANGES. IT WILL NOT WORK.
I've worked with siRNA data before, so I understand their warnings. I'm curious whether any of that warning is also applicable to inferring compound-specific effects from Cell Painting like we have been attempting.
Recursion is going to release a 296GB cell painting dataset on Kaggle as a competition. This dataset uses the same CellPainting Bray 2016 protocol, but with 6 stains (6 channels).
Their experiment is different from U2OS cancer cell drug testing. They have documented the exist of batch effect and called for research opportunity on batch normalization.
The dataset is coming out in June. We can keep an eye on it.