ixxmu / mp_duty

抓取网络文章到github issues保存
https://archives.duty-machine.now.sh/
116 stars 30 forks source link

单细胞TPM数据遇上Seurat v5 #4795

Closed ixxmu closed 6 months ago

ixxmu commented 6 months ago

https://mp.weixin.qq.com/s/uFLgc62O-ER28vkNHJJWig

ixxmu commented 6 months ago

单细胞TPM数据遇上Seurat v5 by 生信星球


 今天是生信星球陪你的第937天

   大神一句话,菜鸟跑半年。我不是大神,但我可以缩短你走弯路的半年~
   就像歌儿唱的那样,如果你不知道该往哪儿走,就留在这学点生信好不好~

   这里有豆豆和花花的学习历程,从新手到进阶,生信路上有你有我!

0.单细胞的TPM数据

众所周知,正常的单细胞数据应该给我们提供count,但偏偏有一些数据是特殊的,给的是tpm数据。准确来说应该是log2(tpm/10+1)

为啥除以10,这么解释的:
TPM values were divided by 10 since we estimate the complexity of our single cell libraries to be on the order of 100,000 transcripts and would like to avoid counting each transcript ~10 times, as would be the case with TPM, which may inflate the difference between the expression level of a gene in cells in which the gene is detected and those in which it is not detected.

查阅seurat的github,对tpm数据的说明是

If you have TPM data, you can simply manually log transform the gene expression matrix in the object@data slot before scaling the data. You have to replace your object@data slot with the desired gene expression matrix as follows: sce.all@data = log(x = norm + 1))

在Seurat v5 版本之前,直接跳过NormalizeData 这个函数即可:

Yes, run CreateSeuratObject() and skip the data normalization step since you’re starting with normalized data

但是呢,当我们使用Seurat v5 ,粗暴的跳过NormalizeData 是会报错的。例如:

1.读取数据

dat = data.table::fread("GSE72056_melanoma_single_cell_revised_v2.txt.gz",data.table = F)
dat[1:4,1:4]

#
#                                                           Cell
## 1                                                        tumor
## 2                           malignant(1=no,2=yes,0=unresolved)
## 3 non-malignant cell type (1=T,2=B,3=Macro.4=Endo.,5=CAF;6=NK)
## 4                                                     C9orf152
##   Cy72_CD45_H02_S758_comb CY58_1_CD45_B02_S974_comb Cy71_CD45_D08_S524_comb
## 1                      72                        58                      71
## 2                       1                         1                       2
## 3                       2                         1                       0
## 4                       0                         0                       0

dat = dat[-(1:3),]
library(tidyverse)
dat = distinct(dat,Cell,.keep_all = T)
library(tibble)
dat = column_to_rownames(dat,"Cell")
dat[1:4,1:4]

#
#          Cy72_CD45_H02_S758_comb CY58_1_CD45_B02_S974_comb
## C9orf152                  0.0000                    0.0000
## RPS11                     9.2172                    8.3745
## ELMO2                     0.0000                    0.0000
## CREB3L1                   0.0000                    0.0000
##          Cy71_CD45_D08_S524_comb Cy81_FNA_CD45_B01_S301_comb
## C9orf152                  0.0000                      0.0000
## RPS11                     9.3130                      7.8876
## ELMO2                     2.1263                      0.0000
## CREB3L1                   0.0000                      0.0000

2.正常创建对象

library(Seurat)
sce.all <- CreateSeuratObject(counts = dat, 
                           project = "a"
                           min.cells = 3, 
                           min.features = 200)

3.质控

sce.all[["percent.mt"]] <- PercentageFeatureSet(sce.all, pattern = "^MT-")
head(sce.all@meta.data, 3)

##                           orig.ident nCount_RNA nFeature_RNA percent.mt
## Cy72_CD45_H02_S758_comb            a   7143.363         3365          0
## CY58_1_CD45_B02_S974_comb          a   8915.833         3637          0
## Cy71_CD45_D08_S524_comb            a  12578.256         4660          0

VlnPlot(sce.all, 
        features = c("nFeature_RNA",
                     "nCount_RNA"
                     "percent.mt"), 
        ncol = 3,pt.size = 0.5)
image.png

线粒体基因比例为0,另两个图也没有很利群的值,这个数据可以不过滤。

4.企图跳过NormalizeData

当我们企图跳过NormalizeData就会报错

sce.all = sce.all %>% 
  #NormalizeData() %>%  
  FindVariableFeatures() %>%  
  ScaleData()

## Error in `ScaleData()`:
## ! No layer matching pattern 'data' found. Please run NormalizeData and retry

明明4版本这样做是木有问题的嘞。

5.查呀!

各种细枝末节都可以在seurat github页面上找到答案:

In the previous versions, “data” was automatically populated with the raw counts prior to running NormalizeData, which is no longer the case in v5. Therefore, if you run ScaleData and the “data” layer does not exist, this will not work.

If you want to run ScaleData on the raw counts (which I wouldn’t recommend doing in most cases), you could do the following: sce.all_small <- ScaleData(sce.all_small, layer = “counts”) https://github.com/satijalab/seurat/issues/8003

所以解决办法就是加上个参数layer = “counts”。虽然叫counts但我们创建对象的时候放了个什么东西进去自己心里没点数嘛?那其实是tpm😄。

sce.all = sce.all %>% 
  #NormalizeData() %>%  
  FindVariableFeatures() %>%  
  ScaleData(layer = "counts") %>%
  RunPCA(pc.genes = pbmc@var.genes)  %>%
  FindNeighbors(dims = 1:15) %>% 
  FindClusters(resolution = 0.5) %>% 
  RunUMAP(dims = 1:15) 

#
# Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
#
## Number of nodes: 4645
## Number of edges: 150415
#
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9164
## Number of communities: 16
## Elapsed time: 0 seconds

后面就可以正常进行啦,顺便把这个数据做完

6.还能做个singleR

# 注释
library(celldex)
library(SingleR)
#ref <- celldex::HumanPrimaryCellAtlasData()
#因为下载太慢,使用了提前存好的本地数据
ref <- get(load("ref_Human_all.RData"))
library(BiocParallel)
pred.scRNA <- SingleR(test = sce.all@assays$RNA$counts, 
                      ref = ref,
                      labels = ref$label.main, 
                      clusters = sce.all@active.ident)
pred.scRNA$pruned.labels

##  [1"T_cells"           "T_cells"           "B_cell"           
##  [4"Neurons"           "Neurons"           "T_cells"          
##  [7"Monocyte"          "T_cells"           "Tissue_stem_cells"
## [10"Neurons"           "Neurons"           "Endothelial_cells"
## [13"T_cells"           "Fibroblasts"       "T_cells"          
## [16"Neurons"

plotScoreHeatmap(pred.scRNA, clusters=pred.scRNA@rownames, fontsize.row = 9,show_colnames = T)
new.cluster.ids <- pred.scRNA$pruned.labels
names(new.cluster.ids) <- levels(sce.all)
levels(sce.all)

##  [1] "0"  "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14"
## [16] "15"

sce.all <- RenameIdents(sce.all,new.cluster.ids)
levels(sce.all)

## [1] "T_cells"           "B_cell"            "Neurons"          
## [4] "Monocyte"          "Tissue_stem_cells" "Endothelial_cells"
## [7] "Fibroblasts"

p = UMAPPlot(object = sce.all, pt.size = 0.5, label = TRUE)
p
如果因为代码看不懂,而跟不上正文的节奏,可以来找我,相当于来一个新手保护期。我的课程都是循环开课。下一期的时间,点进去咨询微信咯
生信分析直播课程
生信新手保护学习小组