lhe17 / nebula

GNU General Public License v2.0
26 stars 6 forks source link

Nebula is failing to recognize that there are the same number of subject ids as count columns #25

Closed AngCamp closed 1 year ago

AngCamp commented 1 year ago

I created a list like the sample_data you provide, with the model matrix. Here is its structure....

>str(dkkl1_nebula_g)

List of 4
 $ count :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
  .. ..@ i       : int [1:8465852] 2 3 7 9 11 14 17 20 21 23 ...
  .. ..@ p       : int [1:1165] 0 7897 16819 24435 32635 40432 48513 55924 60459 64383 ...
  .. ..@ Dim     : int [1:2] 23355 1164
  .. ..@ Dimnames:List of 2
  .. .. ..$ : chr [1:23355] "00R-AC107638.2" "0610005C13Rik" "0610007P14Rik" "0610009B22Rik" ...
  .. .. ..$ : chr [1:1164] "B1_T6_K7_S83_mouse1" "D6_T3_H15_S91_mouse1" "E3_T6_A10_S146_mouse1" "B7_T6_A8_S144_mouse1" ...
  .. ..@ x       : num [1:8465852] 57 35 1 48 42 13 2 17 103 14 ...
  .. ..@ factors : list()
 $ id    : num [1:1164] 1 1 1 1 1 1 1 1 1 1 ...
 $ pred  : num [1:1164, 1:9] 1 1 1 1 1 1 1 1 1 1 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:1164] "1" "2" "3" "4" ...
  .. ..$ : chr [1:9] "(Intercept)" "ConditionContext-Only:LabeltdT+" "ConditionFear-Only:LabeltdT+" "ConditionFear-Recall:LabeltdT+" ...
 $ offset: num [1:1164] 1 1 1 1 1 1 1 1 1 1 ...

I have grouped it with group_cell(), but for some reason when I run nebula on it, it does not recognize that the sample id's are the same length as the number of columns (cells) in the data. What am I doing wrong? The only difference I see between my object and your sample_data object is that mine contains the cell names.

Running nebula on the list above produces this error:

results.dkkl1.nebula <- nebula(dkkl1_nebula_g$count, dkkl1_nebula_g$sid,
                               pred=dkkl1_nebula_g$pred, ncore=2)

Error message:

Error in nebula(dkkl1_nebula_g$count, dkkl1_nebula_g$sid, pred = dkkl1_nebula_g$pred, : The length of subject IDs should be equal to the number of columns of the count matrix.
Traceback:

1. nebula(dkkl1_nebula_g$count, dkkl1_nebula_g$sid, pred = dkkl1_nebula_g$pred, 
 .     ncore = 2)
2. stop("The length of subject IDs should be equal to the number of columns of the count matrix.")

EDIT Not sure if this could also be the issue but the model I am tyring to fit is as follows: dkkl1.nebula.df = model.matrix(~Condition:Label, data=dkkl1_nebula$pred)

Raghav1881 commented 1 year ago

Your pred column in the list dkkl1_nebula_g$pred should not contain the model matrix. Within dkkl1_nebula_g$pred, you should only have predictors associated with each of the cells which you use to build dkkl1.nebula.df i.e. metadata from the original object. If your original object was a Seurat object for example, your predictors would just be dkkl1_nebula_g$pred <- seurat_object$predictor, then build your model matrix from the dkkl1_nebula_g$pred.

lhe17 commented 1 year ago

Hi AngCam,

I'm not sure why my previous reply four days ago does not show up on this thread.

I think the error is in dkkl1_nebula_g$sid when used as an input for nebula. It should be dkkl1_nebula_g$id.

Best regards,

Liang

On Thu, Jun 15, 2023 at 2:09 AM AngCamp @.***> wrote:

I created a list like the sample_data you provide, with the model matrix. Here is its structure....

str(dkkl1_nebula_g)

List of 4 $ count :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots .. ..@ i : int [1:8465852] 2 3 7 9 11 14 17 20 21 23 ... .. ..@ p : int [1:1165] 0 7897 16819 24435 32635 40432 48513 55924 60459 64383 ... .. ..@ Dim : int [1:2] 23355 1164 .. ..@ Dimnames:List of 2 .. .. ..$ : chr [1:23355] "00R-AC107638.2" "0610005C13Rik" "0610007P14Rik" "0610009B22Rik" ... .. .. ..$ : chr [1:1164] "B1_T6_K7_S83_mouse1" "D6_T3_H15_S91_mouse1" "E3_T6_A10_S146_mouse1" "B7_T6_A8_S144_mouse1" ... .. ..@ x : num [1:8465852] 57 35 1 48 42 13 2 17 103 14 ... .. ..@ factors : list() $ id : num [1:1164] 1 1 1 1 1 1 1 1 1 1 ... $ pred : num [1:1164, 1:9] 1 1 1 1 1 1 1 1 1 1 ... ..- attr(*, "dimnames")=List of 2 .. ..$ : chr [1:1164] "1" "2" "3" "4" ... .. ..$ : chr [1:9] "(Intercept)" "ConditionContext-Only:LabeltdT+" "ConditionFear-Only:LabeltdT+" "ConditionFear-Recall:LabeltdT+" ... $ offset: num [1:1164] 1 1 1 1 1 1 1 1 1 1 ...

I have grouped it with group_cell(), but for some reason it does not recognize that the cell names are provided and that the sample id's are the same length as the number of columns (cells) in the data. What am I doing wrong?

Running nebula on the list above produces this error:

results.dkkl1.nebula <- nebula(dkkl1_nebula_g$count, dkkl1_nebula_g$sid, pred=dkkl1_nebula_g$pred, ncore=2)

Error message:

Error in nebula(dkkl1_nebula_g$count, dkkl1_nebula_g$sid, pred = dkkl1_nebula_g$pred, : The length of subject IDs should be equal to the number of columns of the count matrix. Traceback:

  1. nebula(dkkl1_nebula_g$count, dkkl1_nebula_g$sid, pred = dkkl1_nebula_g$pred, . ncore = 2)
  2. stop("The length of subject IDs should be equal to the number of columns of the count matrix.")

— Reply to this email directly, view it on GitHub https://github.com/lhe17/nebula/issues/25, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGDISUTRQ5PRP4YQXJ6P72DXLJHEXANCNFSM6AAAAAAZHCLSWE . You are receiving this because you are subscribed to this thread.Message ID: @.***>

AngCamp commented 1 year ago

Thanks I will try these things out.

AngCamp commented 1 year ago

Thanks these two solutions fixed it. I think it's worth noting that that it's a little unnecessarily confusing that you use data$sid in your tutorial. Also I know most people will probably use a Seurat object but it may be useful for you to provide an explanation for people working with standard csv's how to make an object that works with your package. Most data on GEO as well is stored as a .csv so often people working with publicly available data won't be using sparse matrices, at least not to do simple preprocessing like gene filtering.

I did the following:

# create counts for cell type(s) of interest, do gene filtering first
# in my case this gave me a dataframe called dkkl1.counts.df
# this can now be made into the counts matrix

dkkl1_nebula <- vector(mode = "list", length = 4)
dkkl1_nebula$count <- Matrix(as.matrix(dkkl1.counts.df ),sparse=TRUE)
dim(dkkl1_nebula$count)
dkkl1_nebula$count[1:5,1:5]
233551164
5 x 5 sparse Matrix of class "dgCMatrix"
               B1_T6_K7_S83_mouse1 D6_T3_H15_S91_mouse1 E3_T6_A10_S146_mouse1
00R-AC107638.2                   .                    .                     .
0610005C13Rik                    .                    .                     .
0610007P14Rik                   57                   13                     6
0610009B22Rik                   35                   27                    32
0610009E02Rik                    .                    .                     .
               B7_T6_A8_S144_mouse1 B4_T8_I19_S47_mouse1
00R-AC107638.2                    .                    .
0610005C13Rik                     .                    .
0610007P14Rik                   116                   26
0610009B22Rik                    76                    .
0610009E02Rik                     .                    6

Just a suggestion, could save a user some googling. Many of your users are also going to be biologists (like me) with limited programming experience and may not be familiar with sparse matrices. Might increase the user base if you can save them time with little things like this. Idiot proofing the tutorial for people like me can go a long way.

AngCamp commented 1 year ago

It may help to add a small paragraph to the tutorial just explaining the object nebula is expecting, I know it's easy to deduce by simply running str(sample_data) and by reading the documentation of the functions but it's easy to miss little things if they are not explicitly spelled out. A short paragraph could save a user a lot of time trawling through your documentation, arguably unnecessarily, since it would be quite easy to explain. Also just to reiterate, many users are going to be biologists with limited programming experience. It will not occur to them to do the things I listed above. Seurat has a wide user base not just because it is the "best" package, arguably it is not, but it does have the best tutorials. Users can easily pick the package up and learn to use it.

Thanks for the help =) btw, its appreciated.

lhe17 commented 1 year ago

Hi AngCamp,

Thank you for your suggestions. They will be considered in updated versions.

Best regards, Liang

On Fri, Jun 23, 2023 at 8:19 PM AngCamp @.***> wrote:

It may help to add a small paragraph to the tutorial just explaining the object nebula is expecting, I know it's easy to deduce by simply running str(sample_data) and by reading the documentation of the functions but it's easy to miss little things if they are not explicitly spelled out. A short paragraph could save a user a lot of time trawling through your documentation, arguably unnecessarily, since it would be quite easy to explain. Also just to reiterate, many users are going to be biologists with limited programming experience. It will not occur to them to do the things I listed above. Seurat has a wide user base not just because it is the "best" package, arguably it is not, but it does have the best tutorials. Users can easily pick the package up and learn to use it.

— Reply to this email directly, view it on GitHub https://github.com/lhe17/nebula/issues/25#issuecomment-1604671200, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGDISUVO62MQE3W3PBKIC7DXMXMZJANCNFSM6AAAAAAZHCLSWE . You are receiving this because you commented.Message ID: @.***>