raymondlouie / MiniMarS

4 stars 2 forks source link

Error with citeFuse - using subsamples (seed=8) of dataset1_LEU #26

Open HsiaoChiLiao opened 1 year ago

HsiaoChiLiao commented 1 year ago

Hi Ray,

dataset: dataset1_97antibodies_BoneMarrow_human_LEU_all_31586cells_CLRnorm.RDS

pkg version: (downloaded at 10:30pm on 3 Apr)

packageVersion("ClusterMarkers") [1] ‘0.1.2’

My codes:

#dat1.leu.all
sce <- readRDS(file = paste0(inpath, "dataset1_97antibodies_BoneMarrow_human_LEU_all_31586cells_CLRnorm.RDS"))

input_matrix = t(sce@assays@data$counts) #31586    97
clusters = sce$cell_type

# SCE input example. 
#no further normalisation has been done
#31586    97
sce_in = processInputFormat(sc_object=sce,
                            sce_cluster="cell_type",
                            verbose=TRUE)

# select a subset of clusters to identify markers for
sc_in = sce_in # As an example, select the SCE input
cluster_selection_out= processClusterSelection(sc_in,
                                               clusters_sel=unique(clusters),
                                               verbose=TRUE)
dim(cluster_selection_out$matrix) #31586    97
length(cluster_selection_out$clusters) #31586

# subsampling
final_out = processSubsampling(cluster_selection_out,
                               subsample_num=1000,
                               train_test_ratio = 0.5,
                               cluster_proportion= "proportional",
                               verbose=TRUE,
                               seed = 8)

print(dim(final_out$training_matrix)) #563  97
print(dim(final_out$test_matrix)) #563  97

list_markers = findClusterMarkers(final_out$training_matrix,
                                  final_out$training_clusters,
                                  num_markers=15,
                                  method="citeFuse",
                                  verbose=TRUE)

error message from findClusterMarkers with method="citeFuse"

Using the following method(s): citeFuse
Methods used in this analysis: citeFuse

Caclulating markers using citeFuse.

Error in randomForest.default(t(as.matrix(exprsMat[, idx])), as.factor(droplevels(group)[idx]),  : 
  Can't have empty classes in y.

A similar subsample size works for HC samples from dataset1_97antibodies_BoneMarrow_human_HC_all_49057cells_CLRnorm.RDS

...
print(dim(final_out$training_matrix))
print(dim(final_out$test_matrix))
[1] 541  97
[1] 541  97
Using all methods.
Methods used in this analysis: citeFuse, sc2marker, geneBasis, xgBoost

Caclulating markers using citeFuse.

Caclulating markers using sc2marker.
...

Thank you!

HsiaoChiLiao commented 1 year ago

I am also running multiple runs and found the same error msg appeared for both HC and LEU samples with different seeds:

HC samples seed = 4

# [1] 4
print(dim(final_out$training_matrix))
print(dim(final_out$test_matrix))
[1] 541  97
[1] 541  97
Using all methods.
Methods used in this analysis: citeFuse, sc2marker, geneBasis, xgBoost

Error in randomForest.default(t(as.matrix(exprsMat[, idx])), as.factor(droplevels(group)[idx]),  : 
  Can't have empty classes in y.
Calls: findClusterMarkers ... lapply -> FUN -> <Anonymous> -> randomForest.default
In addition: There were 50 or more warnings (use warnings() to see the first 50)
Execution halted

LEU samples seed = 6

# [1] 6
print(dim(final_out$training_matrix))
print(dim(final_out$test_matrix))
[1] 563  97
[1] 563  97
Using all methods.
Methods used in this analysis: citeFuse, sc2marker, geneBasis, xgBoost

Caclulating markers using citeFuse.

Error in randomForest.default(t(as.matrix(exprsMat[, idx])), as.factor(droplevels(group)[idx]),  : 
  Can't have empty classes in y.
Calls: findClusterMarkers ... lapply -> FUN -> <Anonymous> -> randomForest.default
In addition: There were 50 or more warnings (use warnings() to see the first 50)
Execution halted
anglixue commented 1 year ago

Hi Hsiao-chi

Can you confirm if the input_matrix should be t(sce@assays@data$counts) or sce@assays@data$count? From the instruction, it seems the row should be features and the column be cells.

HsiaoChiLiao commented 1 year ago

Hi Angli,

Thanks for the reminder. But the object input_matrix wasn't used in the marker finding analysis. That was me checking the dimension of the dataset.

What really went into the analysis was the 'sce' object.

#dat1.leu.all
sce <- readRDS(file = paste0(inpath, "dataset1_97antibodies_BoneMarrow_human_LEU_all_31586cells_CLRnorm.RDS"))

sce_in = processInputFormat(sc_object=sce,
                            sce_cluster="cell_type",
                            verbose=TRUE)
raymondlouie commented 1 year ago

I can't seem to reproduce the error : ( Do you mind sending me your final_out variable so I can try to reproduce? Thanks.

anglixue commented 1 year ago

Hi Hsiao-chi, When the sce object was made into the required format, should the row be the features?

From this line dim(cluster_selection_out$matrix) #31586 97 I can see the feature is on the columns. Will this affect the results?

HsiaoChiLiao commented 1 year ago

Hi @anglixue ,

Just tested that using the sce data from the package:

library(ClusterMarkers)
data(sce)
sce
# class: SingleCellExperiment 
# dim: 192 1000 

### First, we convert the input to the desired format required for downstream analysis, showing all three input data examples:
# The 'input_matrix' should be formatted as feature x cell matrix
input_matrix <- sce@assays@data$counts
# The 'clusters'should be a vector of cell cluster annotations corresponding to each cell (i.e., row of the input_matrix)
clusters = sce$cell_type
sc_in = processInputFormat(sc_object = input_matrix,
                           clusters_all = clusters,
                           verbose = TRUE)

### Second, we select a subset of clusters (clusters_sel) to identify markers for. Default is using all clusters.
clusters_sel = c("CD4-positive, alpha-beta memory T cell",
                 "naive thymus-derived CD8-positive, alpha-beta T cell")

cluster_selection_out= processClusterSelection(sc_in,
                                               clusters_sel = clusters_sel,
                                               verbose = TRUE)

dim(cluster_selection_out$matrix)
# [1] 306 192

cluster_selection_out$matrix[1:5,1:5]
#                      CD80 CD86 CD274 CD273 CD275
# AATCCAGAGATAGCAT-1_1    9    1     3     6     2
# AGGCCGTAGCTTATCG-1_1    8    0    10     2     2
# AGTAGTCCAAGCGATG-1_1   11    0     5     6    13
# AGTAGTCTCTAACTGG-1_1   12    0     2     7    20
# ATCCGAACAGCTGTGC-1_1   16    1    13    13     5

We can see that the format of the data matrix in cluster_selection_out became cells x features eventually. And I think the error from citeFuse is not because of this, otherwise, we wouldn't get results from "some" runs (this error happened when I used certain "seeds").

anglixue commented 1 year ago

Thanks. There might be a separate issue. It seems the function will somehow transpose the input matrix internally? @raymondlouie

raymondlouie commented 1 year ago

Hi @HsiaoChiLiao, @anglixue and all, I think I figured out what the error is. If you plot the histogram of the totalcounts (library size), there is one outlier in the training dataset, corresponding to a cell with zero counts or very low counts. If I remove this outlier, the code runs without error. I've now updated the citeFuseWrapper function to remove all cells with a library size < 0.01 quantile:

  # Remove cells with very low library size, which causes issues in CiteFuse
    totalCount = rowSums(sce@assays@data$counts)
    index_remove = which(totalCount < quantile(totalCount,0.01))
    if (length(index_remove)>0){
        message(paste0(length(index_remove), " cell(s) with low library size have been removed.\n"))
        sce = sce[,-index_remove]
    }
HsiaoChiLiao commented 1 year ago

Thanks, Ray. Now I'm running the subsamples with your updated function.

HsiaoChiLiao commented 1 year ago

Hi @raymondlouie

I encountered the same error with other seeds.. (seeds: 107 for HC, 108 for LEU) It seems like your new filter worked but some runs cannot go through with the threshold LS < 0.01 quantile.

Caclulating markers using citeFuse.

1 cell(s) with low library size have been removed.

Error in randomForest.default(t(as.matrix(exprsMat[, idx])), as.factor(droplevels(group)[idx]),  : 
  Can't have empty classes in y.
Calls: findClusterMarkers ... lapply -> FUN -> <Anonymous> -> randomForest.default
In addition: There were 50 or more warnings (use warnings() to see the first 50)
Execution halted
raymondlouie commented 1 year ago

Hi @HsiaoChiLiao , thanks for letting me know. Can you please send me the final_out objects?

raymondlouie commented 1 year ago

Thanks @HsiaoChiLiao , I've fixed this by removing more cells. If the data has been properly QCed, I suspect this method may remove some useful cells. It is hard though to choose a correct threshold. We don't have to do it now, but it might be useful later to see if the cells removed are simply the cells with zero counts in the pre-normalized data set. If this was the case, the filtering can be changed to remove zero count cells.

HsiaoChiLiao commented 1 year ago

Hi @raymondlouie,

I've obtained more final_out objects (70 so far) that led to errors when running citeFuse. Please click here to access the files.

Version: ClusterMarkers_0.1.3

Thanks!

raymondlouie commented 1 year ago

Thanks @HsiaoChiLiao . So the error occurred because the code in the previous version used to remove low-library size also removed cells such that some clusters had zero cells. I've now fixed this by removing these clusters. It should hopefully work now.