greenelab / deep-review

A collaboratively written review paper on deep learning, genomics, and precision medicine
https://greenelab.github.io/deep-review/
Other
1.24k stars 270 forks source link

Realistic in silico generation and augmentation of single cell RNA-seq data using Generative Adversarial Neural Networks #906

Open gwaybio opened 5 years ago

gwaybio commented 5 years ago

https://doi.org/10.1101/390153

A fundamental problem in biomedical research is the low number of observations available, mostly due to a lack of available biosamples, prohibitive costs, or ethical reasons. Augmenting few real observations with generated in silico samples could lead to more robust analysis results and a higher reproducibility rate. Here we propose the use of conditional single cell Generative Adversarial Neural Networks (cscGANs) for the realistic generation of single cell RNA-seq data. cscGANs learn non-linear gene-gene dependencies from complex, multi cell type samples and use this information to generate realistic cells of defined types. Augmenting sparse cell populations with cscGAN generated cells improves downstream analyses such as the detection of marker genes, the robustness and reliability of classifiers, the assessment of novel analysis algorithms, and might reduce the number of animal experiments and costs in consequence. cscGANs outperform existing methods for single cell RNA-seq data generation in quality and hold great promise for the realistic generation and augmentation of other biomedical data types.

gwaybio commented 5 years ago

Summary

The authors describe two distinct generative adversarial network (GAN) architectures that are both used to simulate single cell RNAseq data. The two architectures are:

  1. single cell generative adversarial network (scGAN)
    1. MLP with fully connected layers
      1. Generator: 256 -> 512 -> 1024
      2. Critic: 1024 -> 512 -> 256
  2. conditional single cell generative adversarial network (cscGAN)
    1. Difference from scGAN:
      1. Projection-Based Conditioning
      2. Multiple Critic Outputs (one per cell-type)
      3. Use of Conditional Batch Normalization layers

They applied their models to two datasets:

  1. 68,579 Cells (PBMCs) (Illumina NextSeq 500)
  2. 1.3 Million Cells (Mouse Brain) (10x Genomics)

The motivation of the approach is to simulate and therefore better study rare cell-types, and thus overcome potential sampling biases present in downstream analyses. The assumption is that generating these cells captures the given cell-type heterogeneity. The paper also nicely demonstrates the real ability of deep learning to generate single cell RNAseq data with high fidelity and that data augmentation can increase biological insights.

Computational Methods

In addition to the basic description of the architectures above, the authors also included the following attributes in their models:

  1. Wasserstein loss function
  2. AMSGrad optimization
  3. ReLU activation (except last layer of critic)
  4. Batch Normalization in csGAN
  5. An interesting modification they call a Library Size Normalization layer
    1. The layer rescales input data to have the same total read count per cell

The authors apply the architectures to the datasets and through a series of insightful experiments, show the remarkable ability of the model to specifically artificially generate cells from target cell-type clusters (visualized through t-SNE plots). t-SNE of cell-types generated by Splatter suffered from mode collapse.

Additionally, in a downsampling experiment, the authors show the ability to detect a specific cell-type cluster even with very low percentage of cells. They show that when these cells are generated, the cell-type can still be detected. This may be a bit of circular logic however, since the cell-type cluster was already defined a priori.

Biological Relevance

The ability to generate samples with the provided architectures has many potential biological applications including the study of rare cell types and the ability to improve classification systems. It remains to be seen if additional applications with more biological focus can take advantage of the cscGAN’s abilities. There were many metrics considered in model evaluation, but there were no explicit examples of specific gene expression programs that the GAN latent space may be learning. The t-SNE projections may be rescuing some differences between the real and simulated data.

The methods of the paper were very well described 😄, but there is no reference to publicly available source code ☹️ .

pierremac commented 5 years ago

Hey Greg (and Greene Lab),

I'm Pierre Machart, one of the authors of the manuscript. Thanks a lot for your interest in our work! This is a good summary. Just in case in wasn't clear, our (non-conditional) scGAN also uses Batch Normalization (just not the conditional version of it). This being said, it also work very well without it... (It is essential in the conditional version, as it is the sole mechanism to condition the generation of cells.). The idea of using augmentation to allow for smaller populations, previously unidentified, to be clustered out separately is very interesting (we intend to cover that in future work). However, the way we evaluated the augmentation is fully supervised indeed. We do provide the cluster information a priori (they are just the outcome of the "classic" Louvain clustering in our case). We just show that the ability of a supervised classifier in enhanced by augmenting smaller populations. I hope that clarifies. There indeed isn't any public code yet but it will definitely happen soon. I will keep you posted about that and our future work (including the exploration of additional applications) if you're interested.

Let me know if you have questions. We are always happy to receive attention and feedback.

gwaybio commented 5 years ago

Thanks for the response @pierremac - looking forward to the source code and future iterations of the paper! It seems to me that some pretty nifty latent space manipulations can happen fairly quickly with a well trained model

pierremac commented 5 years ago

One of the many things we intend to look into, indeed! We also just came across this very recent manuscript: https://www.biorxiv.org/content/early/2018/07/30/262501 Their aim is not the realistic simulation of cells but it is closely related to our work and you will maybe appreciate that they put a bit more emphasis on latent space manipulations!

gwaybio commented 5 years ago

Fantastic, thanks for pointing to. Want to summarize in a new issue? Contributions are welcome :)

Edit: linked with #909

pierremac commented 5 years ago

The source code is finally available. Please find it here : https://github.com/imsb-uke/scGAN We'll be very happy to get your feedback!