kkang7 / CDSeq_R_Package

CDSeq R Package
17 stars 10 forks source link

input to cellTypeAssignSCRNA #12

Open gbonifazi opened 2 years ago

gbonifazi commented 2 years ago

Hello there!

I have been using CDSeqR package in the last month and I have a doubt on the usage of the function cellTypeAssignSCRNA.

My understanding is that for the input sc_gep I have to use the GEP as counts and not normalised data. But what is the reason? Considering the first step of the function (cell type assignment using the input sc_gep), I would have thought that the format required for sc_gep was normalised data. In fact, the function computes the correlation between sc_gep and cdseq_gep where the latter is in the format of normalised data. Would it be wrong then to use normalised data for sc_gep?

Thanks in advance for your time and help.

kkang7 commented 2 years ago

@gbonifazi Hi there, the reason why we need sc_gep to be count was because we tried to use CDSeq estimated GEP profile to generate pseudo-counts data and merge them with sc_gep then feed them into the pipeline for clustering etc. If you use normalized sc_gep, then it wouldn't be a bit tricky to merge them with pseudo-counts and clustering them. The correlation was another way to assess the relationship between sc_gep and CDSeq estimated cell type GEPs. In that case, you don't really need sc_gep to be count but you need to make sure they are normalized in the way as CDSeq output GEPs. But we do both the correlation calculation and clustering, so we need sc_gep to be count.

mdepitta commented 2 years ago

Dear Kai (@kkang7), I want to chip in this conversation, as I am eager to use CDSeq in my lab to deconvolve neurodegenerative disease-specific genes from whole-cell gene databases. We are facing some difficulties deploying your software to this aim, and I kindly ask for your input. I have a few questions revolving around the two main functions of the CDseq package:

The CDseq(...) function.

  1. We want to provide a custom `refGEP' for our data. Looking at your refGEP matrix, it appears to be in the form of Genes-by-six cell types. The entries are integer numbers. What do they represent?
  2. Concerning the cell_GEP output from the CDSeq(...) function: this is a matrix of a genes-by-cell type. But the entries appear to be normalized. What is the normalization factor? Can I retrieve it from CDSeq(...).

The function cellTypeAssignSCRNA(...).

  1. One of the input arguments of this function is sc_gep. This matrix is a genes-by-N cell matrix with positive integer numbers. What are such values? What is the difference of sc_GEP with respect to the refGEP matrix?
  2. Another point of concern is that even if I attempt to provide my custom sc_gep, somewhere between L200-201 in the code of cellTypeAssignSCRNA(...) (on the GitHub May-2-2021 version, it appears at line 332), dispersion is computed by what is seems a wrong formula, that is by mu^2/(var-mu).

Thank you in advance for your consideration of the matter and help.

Sincerely,

Maurizio

mdepitta commented 2 years ago

Dear @kkang7, Any update on my original post? I hope this thread is still open on your end. Otherwise, please let me know where to contact you privately. Sincerely, M

kkang7 commented 2 years ago

@mdepitta sorry for the late response.

_> The CDseq(...) function.

  1. We want to provide a custom `refGEP' for our data. Looking at your refGEP matrix, it appears to be in the form of Genes-by-six cell types. The entries are integer numbers. What do they represent?_

The refGEP in the examples were the RNAseq read counts data of pure cell line data of six cell types and the bulk were synthetic mixtures of those six cell types. That was a toy example. Recently, I provided cellTypeAssignSCRNA(...) function for the cell type annotation after deconvolution. So basically, you could ignore the refGEP in the CDSeq function. And annotate CDSeq-estimated cell types after deconvolution.

_> 2. Concerning the cellGEP output from the CDSeq(...) function: this is a matrix of a genes-by-cell type. But the entries appear to be normalized. What is the normalization factor? Can I retrieve it from CDSeq(...).

The output estimated GEP is normalized as a multinomial parameter, i.e. sum of the entries equals to 1. Basically, you can think of the GEP normalized as follows, a/sum(a), if a is a read count profile for a cell type (a column vector). You can retrieve the unnormalized GEP in the form of read counts in celltypeassignsplit variable in the output.

_> The function cellTypeAssignSCRNA(...). 3. One of the input arguments of this function is sc_gep. This matrix is a genes-by-N cell matrix with positive integer numbers. What are such values? What is the difference of sc_GEP with respect to the refGEP matrix? 4. Another point of concern is that even if I attempt to provide my custom scgep, somewhere between L200-201 in the code of cellTypeAssignSCRNA(...) (on the GitHub May-2-2021 version, it appears at line 332), dispersion is computed by what is seems a wrong formula, that is by mu^2/(var-mu).

the sc_gep supposed to be single cell reference in the form of read counts (integers). In the CDSeq function, refGEP was designed to be a matrix of gene by cell type when you have the pure cell line references. But I would say you could ignore that input and do annotation after deconvolution. The formula was based on var = mu + mu^2/size (https://stat.ethz.ch/R-manual/R-devel/library/stats/html/NegBinomial.html), where size is the dispersion.