simslab / scHPF

Single-cell Hierarchical Poisson Factorization
BSD 2-Clause "Simplified" License
65 stars 10 forks source link

Preprocessing of scHPF #14

Open AstreChen opened 4 years ago

AstreChen commented 4 years ago

Thanks for the useful tool, I'm trying to use it to find the correlated functional modules which distinguish the cell clusters.

As for the "ranked_genes" outputted from scHPF, I wonder that how does these genes are ranked? Are the genes in the same module at the top more correlated with each other? I noticed that the distribution of expression of the top genes in the same module vary a lot (see the figure)image. I don't know why. Maybe it's because of the input of gene matrix without consideration of normalization or dropout? I just input the UMI count without normalization into the scHPF prep. Does normalization or dropout have big effect on the ranked genes?

It will be great if you can provide more experience on how to prepare the input data, and how to use the results explore the features of cell cluster (eg. co-expression or something else).

Thanks a lot!

hannaml commented 4 years ago

Using the raw UMI matrix as you described is the correct input for scHPF. The "ranked_genes" file orders genes by their per-factor gene scores, which you can think of as a normalized factor loading.

To my eye, the genes you posted do seem to be fairly co-localized given their difference in average expression level. FOS has much higher mean expression than the other genes shown, and accordingly has lower dropout and is observed in more cells. Thus, even if the more lowly expressed genes were expressed in exactly the same cells as FOS, we would be far less likely to capture and reverse transcribe them. For example, I bet if you downsampled FOS to have the same total # of molecules as RASD1, the expression distribution would look fairly visually similar. Supporting this, SNAI1, ID1, ADM, and RASD1 all have a similar range of expression, and seem to co-localize. ENKUR is expressed so lowly, and in so few cells, that it seems hard to say much about if it's correlated with anything else or not.

Another important point is that some genes may have high scores for multiple factors, leading to different patterns of overall expression for genes highly ranked in the same factor. This agrees with our understanding of expression programs, as some genes may be involved in multiple biological processes, while others are more restricted. FOS is actually a great example of this, as it is an intermediate early transcription factor that may be induced by stress but also plays essential roles in proliferation and differentiation for some cell types.

AstreChen commented 4 years ago

Using the raw UMI matrix as you described is the correct input for scHPF. The "ranked_genes" file orders genes by their per-factor gene scores, which you can think of as a normalized factor loading.

To my eye, the genes you posted do seem to be fairly co-localized given their difference in average expression level. FOS has much higher mean expression than the other genes shown, and accordingly has lower dropout and is observed in more cells. Thus, even if the more lowly expressed genes were expressed in exactly the same cells as FOS, we would be far less likely to capture and reverse transcribe them. For example, I bet if you downsampled FOS to have the same total # of molecules as RASD1, the expression distribution would look fairly visually similar. Supporting this, SNAI1, ID1, ADM, and RASD1 all have a similar range of expression, and seem to co-localize. ENKUR is expressed so lowly, and in so few cells, that it seems hard to say much about if it's correlated with anything else or not.

Another important point is that some genes may have high scores for multiple factors, leading to different patterns of overall expression for genes highly ranked in the same factor. This agrees with our understanding of expression programs, as some genes may be involved in multiple biological processes, while others are more restricted. FOS is actually a great example of this, as it is an intermediate early transcription factor that may be induced by stress but also plays essential roles in proliferation and differentiation for some cell types.

Thanks a lot for your explicit explanation. I got your point. The method of downsampling sounds great to solve the problems like dropout. To my eye, if two genes show the huge variation of drop out rate, for example, FOS and ENKUR, I guess ENKUR is more likely to have lower expression than FOS, namely, they are not similar to each other. However, in my ranked gene list, ENKUR is ranked 5th of "FOS" factor. I think maybe I need to remove these low expressed genes at first to avoid ambiguous result like this? Because I want to use the top genes (which I expected co-expressed with each other) to define a functional module, like Fig 4a in Peter A. Szabo et al., 2019. image

Thank you a lot for your careful explanation again.