Availability of data used by code?

jamesnemesh commented 3 years ago

Hi! Excellent paper with very thoughtful consideration of how to leverage latent factor analysis to understand how factors map onto biological pathways.

Our lab is very interested in reproducing some of your methodology. The code as given is very helpful in understanding some of the more fine grained details from the manuscript, but at times the R code is harder to interpret because you're reading in RDS serialized data that other people can't see the structure of, making the code significantly harder to read. For example, if I wanted to better understand assignLoadings.R, having the files in RA_pipeline would make life much easier to simply debug through your code to understand the section of the methods that says "each pathway activity was set as the response variable in a regression setting where the cluster labels function as the predictor". I'm guessing you're actually regressing against the median latent factor scores for the cell label (or similar), but having the data structure you're loading in would let me understand your methods far more completely.

Would it be possible to release some of the data that's loaded in by the scripts, at least in cases where the processed data was generated by you, not the primary data you downloaded from other labs (which of course, I'd expect I'd download myself if I want to reproduce that part of the analysis.)

Thanks for your attention.

giovp commented 3 years ago

Hi @jamesnemesh ,

thanks for the interest in the analysis! I doubt that processed data will be available, but you can find all the pre processing done here: https://github.com/giovp/latent_factors_autoimmune/tree/master/src/preprocessing You'll notice its just standard SingleCellExperiment normalization+clustering.

I'm guessing you're actually regressing against the median latent factor scores for the cell label (or similar), but having the data structure you're loading in would let me understand your methods far more completely.

yes pretty much. Let me be more specific:

the regression model is the following: https://github.com/giovp/latent_factors_autoimmune/blob/e09adf98afc5f1323bf67457b041acc840e74f23/src/assignLoadings/assignLoadings.R#L101 where y is the pathway activity score, and cluster is the cluster label (handled as categorical internally).
the pathway activity y is not just one latent factors, but the aggregated medians of n of them, that we find by clustering the loadings. In figure 2 first cartoon we make this clear: several factors since share correlating weights, are median aggregated in a single "factor", which we call the pathway activity. The clustering steps of the factors in pathway activity was a bit heuristic, see https://github.com/giovp/latent_factors_autoimmune/blob/e09adf98afc5f1323bf67457b041acc840e74f23/src/assignLoadings/assignLoadings.R#L65 and https://github.com/giovp/latent_factors_autoimmune/blob/e09adf98afc5f1323bf67457b041acc840e74f23/src/assignLoadings/utils.R#L51 I think there could be better way to do it.
We then take the coefficients of the fittem model: https://github.com/giovp/latent_factors_autoimmune/blob/e09adf98afc5f1323bf67457b041acc840e74f23/src/assignLoadings/assignLoadings.R#L102 and plot those in the heatmap.

I should mention that similar ideas have been explored by https://elifesciences.org/articles/43803 where they also adopted a similar aggregation strategy (although across iterations and not across factors).

Hope this is clear, happy to answer any other question!

Best, Giovanni

jamesnemesh commented 3 years ago

That's super helpful, thank you for getting back to me so quickly! OK, it really was what described in the paper - using categorical labels as predictors of the pathway activity, which for some reason I thought was "too simple", but makes sense. The clarification is great, and that additional reference is appreciated!

giovp commented 3 years ago

no problem at all, happy to help! indeed it's a very simple approach (maybe too simple?). I'd argue that since it boils down to just regression against the pathway activity, more powerful ideas revolving around GLMs could be used e.g. including additional covariates, or likelihoods etc.

giovp / latent_factors_autoimmune

Availability of data used by code? #1