randel / MIND

Using Bulk Gene Expression to Estimate Cell-Type-Specific Gene Expression via Deconvolution
https://randel.github.io/MIND/
43 stars 9 forks source link

Problems with 3D input arrays #5

Closed MiguelCos closed 3 years ago

MiguelCos commented 3 years ago

Hello,

I find this package awesome I would like to consult your opinion on its potential application to our specific problem/question and also report on some difficulties that I am getting on setting up the input data.

I am testing this approach to identify protein expression signatures from cancer xenografts using mass-spec proteomics. We want to identify signature proteins from tumors or stroma and potentially identify expression patterns within tumors or stroma.

We have bulk mass-spec data from the whole tumor+stromal region, but we can distinguish both by identifying proteins that are either specific to human (tumor) or mouse (stroma). Therefore we have two expression matrices that can each be associated with a specific tissue region.

We don't have single-cell data, but I managed to generate a signature matrix from mining the Human Protein Atlast for single-cell-specific expression patterns of cells that I consider to be potentially found in stroma or tumor tissue.

With this, I can generate cell fraction matrices using est_frac for both human(tumor) and mouse(stroma).

Then I am generating two 3D arrays: bulk input and frac input.

Bulk input:

bulk_array <- abind::abind(data_log2_med_normimpaft_mat, 
                           data_log2_med_normimpaft_mat_hs, 
                           along = 3)

Frac input:

frac_array <- abind::abind(cell_fraction_mouse,
                           cell_fraction_human, 
                           along = 3)

I am having an error when executing bMIND2, which I understand has relation with the way I set up my arrays:

deconv_bayes <- bMIND2(bulk_array, frac_array)
## [1] "1470 errors"
## List of 1
##  $ :List of 2
##   ..$ message: chr "incorrect number of dimensions"
##   ..$ call   : language X[j, ]
##   ..- attr(*, "class")= chr [1:3] "simpleError" "error" "condition"
## NULL
## Error in `rownames<-`(`*tmp*`, value = names(res)): attempt to set 'rownames' on an object with no dimensions

I am comparing my arrays with the example arrays and it seems evident that my arrays need some tunning, but I am having problems now setting them up.

Two questions:

  1. Could you share a little bit more on how to generate the 3D arrays (in R) that you use in your example in the context of your biological question that you very nicely address in the publication(s)?
  2. Would you have any comments on the approach I am describing in which I'll use your tool? I would like to corroborate that I am understanding its application correctly and that it would actually be helpful for my current analysis.

Many thanks for taking the time to read!

Best wishes, Miguel

randel commented 3 years ago

Hi Miguel,

For bMIND2, there is no need to have a 3D array, which I had from example data, and converted it to a matrix for convenience. bMIND2 only needs a matrix of bulk data (gene x sample) and fractions (sample x cell type). Please let me know if you have further questions on the input format.

Are you focusing on protein expression data, even for single-cell proteomics? It seems that there is RNA gene expression on the protein atlas website as well.

MiguelCos commented 3 years ago

Hello randel,

Many thanks for your answer.

I will try with setting the a simple matrices for each tissue region separately then.

Are you focusing on protein expression data, even for single-cell proteomics? It seems that there is RNA gene expression on the protein atlas website as well.

I am not sure I understand your question. We only have two matrices of normalized protein expression data, arousing from mass spectrometry of the whole tissues (I would call this bulk data).

We don't have any kind of own single-cell proteomics data, but we are using the compiled single-cell expression data from the human protein atlas (which I think is RNA-based) to generate our signature matrix and to use est_frac to estimate the fraction of cells per sample based on our protein expression data.

I understand the correlation between RNA levels and protein levels is about ~0.6 but I couldn't find any summarized resource of single-cell proteomics data.

Would have any argument against this kind of approach for the use of the MIND package?

randel commented 3 years ago

Yes, you can analyze each tissue region separately. If you'd like to do estimation with multiple tissue regions together, you can set the sample_id option with subject ID, e.g., 1, 1, 2, denoting two samples from subject 1 and one sample from subject 2. bMIND will do the estimation and testing at the subject level. But this seems not to apply to your data since you have two matrices from mouse and human?

For your second question, do you mean that you can use scRNA-seq data as a reference/signature matrix to deconvolve protein expression data? I have not seen people doing it this way, but it may work if RNA-seq data is proportional to protein expression. Please let me know if the estimated cell-type fractions make sense to you.