Open pierremac opened 5 years ago
Authors use a GAN on scRNA-seq data (epidermal, neural and hematopoietic cells from different experiments) and show the simulated cell overlap with the real ones on t-SNE visualization. Then, they use the latent representation, to simulate cellular perturbations (i.e. basal to differentiated cells in their experiments) by interpolating between two points in the latent space and generating the corresponding cells. They perform sensitivity analysis on the discriminator network and use the result to identify the marker genes that are relevant to the GAN representation. They highlight that some of those identified genes are markers already known in the literature. They also use the weights in the last layer of the generator network to study the linear dependencies (correlation) that are expressed in the GAN representation, highlighting some biologically relevant dependencies that are not unraveled by classical expression analysis methods. Finally, the also use the features learnt in the (unique) hidden layer of the critic network as a dimensionality reduction technique and show that surprisingly, those features seem invariant to batch effects and yet seem to preserve interesting biological properties (different cell types cluster separately in t-SNE representation computed on top of those "GAN critic features", for instance).
In the paper, they describe a Wasserstein GAN (with gradient penalty term to enforce the Lipschitz constraint) with fully connected networks with a unique hidden layer (600 neurons for the generator, 200 for the critic), and a latent space of dimension 100. They use Leaky ReLU activation functions (with 0.2 coefficient). Interestingly they use an additive mixture of a Gaussian and a Poisson distribution for the latent noise. They use RMSProp to optimize the GAN with a batch size of 32. However, they also provide a link to a git repo, which does not match those specifications (it's a classic GAN with 2 layers in the generator, using a modified version of ADAM to stabilize the training). I think that implementation is outdated and has not been updated to match this version of their paper.
I'm a bit critical about some claims in the paper. For instance, they claim in the abstract that " In contrast to many machine-learning approaches, we are able to interpret internal parameters in a biologically meaningful manner". To my understanding, this is a reference to their sensitivity analysis which is interesting and indeed gives some interesting and relevant insights, but is also a bit rough and does not guarantee to capture the most important features. I don't agree with their claim that the reason why their simulated cells don't represent the full variability of the real cells is that the generator uses continuous inputs while the gene expression distributions are discrete. It is probably a minor detail but reflects that there might be some "over-statements" in the manuscript. I very much like the part about the differentiation. However, it also contains what I think is the main technical weakness of their paper. It is not possible to directly map a cell to a point in the latent space with a GAN. To overcome this limitation, they randomly simulate cells until they find one that is similar enough to the one they wanted to map (and then use the coordinates of the corresponding latent code as the map). Overall, though, it is a very interesting and stimulating paper in my opinion.
Thanks for the contribution @pierremac. I edited the original post to use the DOI link https://doi.org/10.1101/262501, which makes it easier for us to add citations with Manubot.
Alright, let me know if some other adjustments are required!
https://doi.org/10.1101/262501