Open gwaybio opened 5 years ago
The authors describe two distinct generative adversarial network (GAN) architectures that are both used to simulate single cell RNAseq data. The two architectures are:
scGAN
)
cscGAN
)
scGAN
:
They applied their models to two datasets:
The motivation of the approach is to simulate and therefore better study rare cell-types, and thus overcome potential sampling biases present in downstream analyses. The assumption is that generating these cells captures the given cell-type heterogeneity. The paper also nicely demonstrates the real ability of deep learning to generate single cell RNAseq data with high fidelity and that data augmentation can increase biological insights.
In addition to the basic description of the architectures above, the authors also included the following attributes in their models:
csGAN
Library Size Normalization
layer
The authors apply the architectures to the datasets and through a series of insightful experiments, show the remarkable ability of the model to specifically artificially generate cells from target cell-type clusters (visualized through t-SNE plots). t-SNE of cell-types generated by Splatter suffered from mode collapse.
Additionally, in a downsampling experiment, the authors show the ability to detect a specific cell-type cluster even with very low percentage of cells. They show that when these cells are generated, the cell-type can still be detected. This may be a bit of circular logic however, since the cell-type cluster was already defined a priori.
The ability to generate samples with the provided architectures has many potential biological applications including the study of rare cell types and the ability to improve classification systems. It remains to be seen if additional applications with more biological focus can take advantage of the cscGAN
’s abilities. There were many metrics considered in model evaluation, but there were no explicit examples of specific gene expression programs that the GAN latent space may be learning. The t-SNE projections may be rescuing some differences between the real and simulated data.
The methods of the paper were very well described 😄, but there is no reference to publicly available source code ☹️ .
Hey Greg (and Greene Lab),
I'm Pierre Machart, one of the authors of the manuscript. Thanks a lot for your interest in our work! This is a good summary. Just in case in wasn't clear, our (non-conditional) scGAN also uses Batch Normalization (just not the conditional version of it). This being said, it also work very well without it... (It is essential in the conditional version, as it is the sole mechanism to condition the generation of cells.). The idea of using augmentation to allow for smaller populations, previously unidentified, to be clustered out separately is very interesting (we intend to cover that in future work). However, the way we evaluated the augmentation is fully supervised indeed. We do provide the cluster information a priori (they are just the outcome of the "classic" Louvain clustering in our case). We just show that the ability of a supervised classifier in enhanced by augmenting smaller populations. I hope that clarifies. There indeed isn't any public code yet but it will definitely happen soon. I will keep you posted about that and our future work (including the exploration of additional applications) if you're interested.
Let me know if you have questions. We are always happy to receive attention and feedback.
Thanks for the response @pierremac - looking forward to the source code and future iterations of the paper! It seems to me that some pretty nifty latent space manipulations can happen fairly quickly with a well trained model
One of the many things we intend to look into, indeed! We also just came across this very recent manuscript: https://www.biorxiv.org/content/early/2018/07/30/262501 Their aim is not the realistic simulation of cells but it is closely related to our work and you will maybe appreciate that they put a bit more emphasis on latent space manipulations!
Fantastic, thanks for pointing to. Want to summarize in a new issue? Contributions are welcome :)
Edit: linked with #909
The source code is finally available. Please find it here : https://github.com/imsb-uke/scGAN We'll be very happy to get your feedback!
https://doi.org/10.1101/390153