gqi / DAESC

10 stars 1 forks source link

Specifying the design matrix for differential ASE #1

Closed kimkh415 closed 1 year ago

kimkh415 commented 1 year ago

Hello DAESC dev team!

First of all, thank you for developing such flexible framework to test for allelic imbalance.

I am trying to incorporate DAESC into my analysis pipeline that tests for differential ASE across multiple individuals accounting for the disease status (condition), spatial location (cortical layers) and cell type.

To begin with, I applied DAESC on a toy dataset from one gene, two conditions, and two individuals per condition. Here, I fit the baseline model (DAESC-BB) specifying the design matrix x as a binary numeric array denoting the condition.

Now, I want to extend this by also taking into account the spatial information and the interaction terms. Since both condition and spatial location as in cortical layers are categorical variables, I tried using model.matrix function to encode the information in one-hot matrix. Here is the structure of the data frame I am working with and how I am invoking the daesc_bb function.

str(cur.df) 'data.frame': 3046 obs. of 8 variables: $ gene : chr "HES6" "HES6" "HES6" "HES6" ... $ barcode : chr "CTCTCTAACTGCCTAG" "TCGGCGTACTGCACAA" "GGGCAGGATTTCTGTG" "CGGTTCCGGCTTCTTG" ... $ allele1_count: num 1 1 1 1 1 1 0 3 0 1 ... $ allele2_count: num 1 0 0 0 1 1 1 0 1 0 ... $ total_count : num 2 1 1 1 2 2 1 3 1 1 ... $ condition : Factor w/ 2 levels "HC","PD": 1 1 1 1 1 1 1 1 1 1 ... $ sample_id : chr "BN0339" "BN0339" "BN0339" "BN0339" ... $ layer : Factor w/ 7 levels "Layer 1","Layer 2",..: 7 4 7 4 5 3 6 6 5 6 ...

myformula = ~ cur.df$condition + cur.df$layer + cur.df$condition:cur.df$layer + 0 one_hot = model.matrix(myformula)

str(one_hot) num [1:3046, 1:14] 1 1 1 1 1 1 1 1 1 1 ... - attr(, "dimnames")=List of 2 ..$ : chr [1:3046] "1" "2" "3" "4" ... ..$ : chr [1:14] "cur.df$conditionHC" "cur.df$conditionPD" "cur.df$layerLayer 2" "cur.df$layerLayer 3" ... - attr(, "assign")= int [1:14] 1 1 2 2 2 2 2 2 3 3 ... - attr(*, "contrasts")=List of 2 ..$ cur.df$condition: chr "contr.treatment" ..$ cur.df$layer : chr "contr.treatment"

res = daesc_bb(y=cur.df$allele1_count, n=cur.df$total_count, subj=cur.df$sample_id, x=one_hot)

When I run this as is, I get the following error:

fixed-effect model matrix is rank deficient so dropping 1 column / coefficient cur.df.conditionPD NA Error in aod::betabin(cbind(y, n - y) ~ ., random = ~1, data = data.frame(y, : Initial values for the fixed effects contain at least one missing value.

I think the reason is in the design matrix, where the first two columns have essentially the same information. After dropping the first column in one_hot, it runs without error, but I want to double check whether this is the intended use of this variable when supplying multiple categorical variables.

Finally, I also want to add cell type information as another independent variable. In my case, cell type is not a categorical variable, since the data was generated using Visium. So for each cell type, the input data will be its estimated abundance. If I were to input all (1) condition, (2) layer and (3) cell type into x, how should I structure it?

Thank you!

gqi commented 1 year ago

Thank you for your interest in DAESC. DAESC automatically append an intercept column (1's across all cells) to the design matrix. That probably caused your design matrix to be rank deficient. Currently, I would suggest deleting one column in the design matrix. You could choose the (condition, layer) pair that is as baseline. Then the coefficients can be interpreted as difference compared to the baseline. I will update the software within a week or two to allow arbitrary design matrix.

There should not be any problem incorporating continuous covariates such as cell type abundance into the model. You can append the cell type abundance to the current design matrix.

Let me know if you have further questions!

kimkh415 commented 1 year ago

Thank you very much for your reply. Will proceed as you suggested.