poseidonchan / TAPE

Deep learning-based tissue compositions and cell-type-specific gene expression analysis with tissue-adaptive autoencoder (TAPE)
https://sctape.readthedocs.io/
GNU General Public License v3.0
47 stars 9 forks source link

Accessing signature matrix when adaptive = False? #6

Open fojackson8 opened 1 year ago

fojackson8 commented 1 year ago

Hi, thanks for the nice paper and code.

As I understand it, the decoder function reconstructs the bulk RNA-seq input B from X (the predicted cell fractions). The learnt weights of the decoder function are then taken as the GEPs, represented in the signature matrix S in the following equation:

$$X.S = B$$

If this is the case, we should be able to access these GEPs on the simulated data, even when not in adaptive mode, ie even if we take simulated bulk RNA-seq as input data and reconstruct, we should be able to get the GEPs right?

There seems to be no option in the codebase to get Sigm when not in adaptive phase.

Happy to be corrected if I've misunderstood, or alternatively if I've missed something in the code?

poseidonchan commented 1 year ago

Hi,

You're right, I did not provide the access because I think it is useless. If we want the Signature matrix, why dont we just group the single-cell dataframe by cell type? The learned one is just an approximated signature matrix. Or, you can obtain the learned one by modifying the code. Like calling model.sigmatrix() in the prediction step to obtain the S, it should be very quick and easy.

fojackson8 commented 1 year ago

Thanks for the response, that's very helpful. So model.sigmatrix() returns a tensor of shape (5,16793) on the example data, so I assume these are the 5 underlying cell type GEPs? If I've understood correctly this gives the same output as sigm when callling predict with adaptive = True.

The reason we might need the signature matrix even when adaptive =False, is it helps to validate the performance/accuracy of the autoencoder model. If the reconstructed GEPs are not similar to the underlying GEPs used as input to the simulated mixing (even when adaptive=False), then perhaps this reduces confidence in the reconstruction process and the inferred cell fractions?

I think this raises an important related question: say we run this model on new bulk RNA-seq data to get the underlying cell fractions. If the inferred GEPs taken from the Decoder model weights do not correspond to any signatures from the single-cell RNA seq reference, how do we interpret this? Does this make the inferred cell fractions?

poseidonchan commented 1 year ago

It's a good point, I think this situation would happen if the model is under-fitting. Ideally, if the model is trained well on the simulated datasets, the learned GEP should be close to the real GEP from single-cell dataset. Generally, the big issue in deconvolution is that, the chosen reference is very different from the bulk. We can not evaluate the distance / batch effect by evaluating how well the model is trained. For the batch correction in the deconvolution problem, I suggest you take a look at the solution of CIBERSORTx, which uses B mode / S mode to minimize the distance between reference and bulk before deconvolution.

fojackson8 commented 1 year ago

Ok thanks. Agreed we cannot evaluate the distance between the reference scRNA-seq and any new bulk RNA-seq we want to deconvolute. For any new bulk RNA-seq we also don't know the "true" proportions in the mixture, just have to hope it's accurate. I guess the best way of evaluating how reliable the estimated cell type proportions are in new bulk RNA-seq data is whether or not the reconstructed GEPs correspond to real GEPs?

Question 2: when you apply the trained model for deconvoluting real data, you first train the model on simulated data. What is the dimensions of the simulated data which you use to train the model? How many genes and samples, and what is the approx. training time?