jsxlei / SCALE

Single-cell ATAC-seq analysis via Latent feature Extraction
MIT License
97 stars 17 forks source link

Question about gamma in scale_ELBOW #25

Open felix-yuxiang opened 1 year ago

felix-yuxiang commented 1 year ago

Hi, I am trying to apply this to the Pancancer dataset (Sequencing-based gene expression) from the ICGC portal but the result was no good. The data points entangle with each other from donors with different cancers.

As I went through your code, I noticed

    logpzc = -0.5*torch.sum(gamma*torch.sum(math.log(2*math.pi) + \
                                           torch.log(var_c) + \
                                           torch.exp(logvar_expand)/var_c + \
                                           (mu_expand-mu_c)**2/var_c, dim=1), dim=1)

    # log p(c)
    logpc = torch.sum(gamma*torch.log(pi), 1)

    # log q(z|x) or q entropy    
    qentropy = -0.5*torch.sum(1+logvar+math.log(2*math.pi), 1)

    # log q(c|x)
    logqcx = torch.sum(gamma*torch.log(gamma), 1)

and I know gamma=q(c|x)=p(c|z). Could you show me how to go from here to logpc and logpcx? I would not use gamma in either case though.

jsxlei commented 1 year ago

SCALE is designed for sparse scATAC-seq datasets, of course it can run scRNA-seq data if you add option "--log_transform" in your command. But I highly recommend you use SCALEX (https://github.com/jsxlei/SCALEX), an updated version especially for heterogenous scRNA-seq. If these won't solve your problem, could you provide more details about your data, e.g. shape, batch, and how you run the program?

As for the gamma, gamma is p(c|z), so the p(c) is short for the expectation of p(c) over q(z,c|x), same for the logqcx. e.g. Eq(z,c|x) logp(c) =

Screenshot 2023-03-29 at 22 14 28

while Eq(c|x) logq(c|x) =

Screenshot 2023-03-29 at 22 15 42

Hope this could help!

felix-yuxiang commented 1 year ago

Thank you for your reply!

The dataset I use is a subset of the bulk RNA seq data from the ICGC portal (6 different cancer types). I run the program with n_centroids=6. Eventually, when I visualize the data encoded in the latent space, I find no obvious clusters for patients with different cancers even though a vanilla autoencoder can cluster the points nicely.

Here is my experimental setup

binary = False
latent = 128
encode_dim = [2048, 512]
decode_dim = [512, 2048]
input_dim = 20502

PS. I understand the equation now but I think the math derivation pictures you put are identical.

jsxlei commented 1 year ago

I think the reason is due to the latent dimention you choose is quite large. Just use the default parameter plus the --log_transform. It should work, at the same time, you can also try SCALEX using the default parameters. Let me know whether it solves your problem.

Sorry for put two identical pictures. The equation for the first term is attached here.

Screenshot 2023-03-30 at 14 39 43
felix-yuxiang commented 1 year ago

I tried to reduce the latent dimension but there is no luck. The clustering result still looks pretty bad. Do you suggest that I should try the default dim in the code? (i.e. latent = 10 encode_dim = [1024, 128] decode_dim = [])

image

Another concern is that the kl loss is negative during training, as shown above. I use MSE loss for the reconstruction loss.

In addition, I think the arg log-transform has never been used in the code in __init__.py. I manually log-transform my data by X=np.log(X+1) but the recon_loss and kl_loss go to inf/nan after a couple of epochs.

jsxlei commented 1 year ago

Why the recon_loss is 0? This does not make any sense at all. I think you should check your data first. SCALE takes count matrix without any normalization as input. Also providing more information showed by SCALE also helps debug.

felix-yuxiang commented 1 year ago

Thank you for your quick and kind reply!

I will provide more info here. Since I am using bulk sequencing data, I cloned this repo and extract a few methods that I need.

image

This is a glimpse of my data, all are raw counts and there is no nan.

Here is a detailed version of my code. Now the recon_loss and kl_loss go to inf/nan quickly. Could you shed some light on how to resolve this issue?

jsxlei commented 1 year ago

I just saw the max count can be 540254, and the smallest is 2 or 0 in your data. The numerical difference is big. You’d better do the log1p transformation to scale the number, since SCALE assume the input is binary. Another thing is that make sure the sample feature matrix right to take each sample as one input instead of each feature as one input.

On Fri, Apr 7, 2023 at 15:01 Felix Fu @.***> wrote:

Thank you for your quick and kind reply!

I will provide more info here. Since I am using bulk sequencing data, I cloned this repo and extract a few methods that I need. [image: image] https://user-images.githubusercontent.com/45562044/230659975-4aeda15d-d223-47b3-ad54-ab5769000fd9.png This is a glimpse of my data, all are raw counts and there is no nan.

Here is a detailed version https://codefile.io/f/n8MiA0yfk2J4wDI6qNmh of my code. Now the recon_loss and kl_loss go to inf/nan quickly. Could you shed some light on how to resolve this issue?

— Reply to this email directly, view it on GitHub https://github.com/jsxlei/SCALE/issues/25#issuecomment-1500550130, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD5YJUMUW4CV7DN6WEEBCADXABP7XANCNFSM6AAAAAAWMPRBAM . You are receiving this because you commented.Message ID: @.***>