ixxmu / mp_duty

抓取网络文章到github issues保存
https://archives.duty-machine.now.sh/
122 stars 30 forks source link

如何突破单细胞数据获取的门槛:从GEO到Cell Ranger #5243

Closed ixxmu closed 4 months ago

ixxmu commented 4 months ago

https://mp.weixin.qq.com/s/DL6Hv4XPmpsOnTi1KQ9WJA

ixxmu commented 4 months ago

如何突破单细胞数据获取的门槛:从GEO到Cell Ranger by 生信菜鸟团

书接上回,一步步尝试代码复现,然后,我们就来到了Figure 2.I,乍看只是平平无奇的堆叠图嘛,殊不知这是多个外部数据集整理后的对比~

在文章的External dataset mapping部分,作者给出了这几个数据集的来源:

  1. Samples from De Jong et al. were downloaded from EBI (E-MTAB-9139), and myeloma samples were excluded (CBM1_1, CBM1_2, CBM2_1, CBM2_2, S9, S10, S11, S12, and S13). Count matrices were generated using Cell Ranger 4.0.0 with default parameters. Samples were integrated and CXCL12+ clusters were exported.

  2. Ennis et al. data from GSE227903 and used the SeuratDisk package to convert from the provided aggregated h5ad data to a Seurat object and reprocessed the data.

  3. Jardine et al.原文给的是EMBL-EBI的ID,但是,数据格式如下:


    https://github.com/haniffalab/FCA_bone_marrow

    于是在这里找到了:. The following data are also available to download as Scanpy h5ad objects with transformed counts via our interactive webportal: https://fbm.cellatlas.io/: i) DS FBM scRNA-seq, ii) non-DS FBM scRNA-seq, iii) CD34+ FBM, FL and CB CITE-seq, iv) FBM total CITE-seq. 获取到lH5AD 格式的文件,处理起来更有头绪~

  4. Data from Li et al. were downloaded from NCBI GEO (GSE190965) and directly read into Seurat and reprocessed as in our dataset described above.

  5. For Wang et al. data, we downloaded feature barcode matrices from NCBI GSE147287 and read into Seurat after concatenating the spliced and unspliced matrix files and reprocessed as above.

  6. For Triana et al. data, we retrieved the Seurat object directly from Figshare (https:// figshare.com/articles/dataset/Expression_of_97_surface_markers_and_462_mRNAs_in_70017_cells_from_healthy_young_healthy_ old_and_leukemic_human_bone_marrow/13397651).

哇噢!六个数据集,又可以get六个经验值,那就赶紧学习起来~

先从第一个数据集开始,上来就是fastq文件,需要cellranger加工一下,那就开始吧——

获取数据

E-MTAB-9139 < ArrayExpress <https://www.ebi.ac.uk/biostudies/arrayexpress/studies/E-MTAB-9139


这么大的数据,肯定是按需下载,只下载非疾病组的样本即可。我们应该如何对应上样本信息呢?

Detailed sample information and links to data [view table https://www.ebi.ac.uk/biostudies/arrayexpress/studies/E-MTAB-9139/sdrf ]根据这里来。➡幸运的是作者提供了sh代码,直接下载对应的fastq文件即可~

下载cellranger和参考基因组

[Download Cell Ranger - Official 10x Genomics Support https://www.10xgenomics.com/support/software/cell-ranger/downloads#download-links)

[Reference Release Notes - Official 10x Genomics Support https://www.10xgenomics.com/support/software/cell-ranger/latest/release-notes/cr-reference-release-notes#2020-a)

然后根据安装教程一步步来完成后续工作:

[Installing Cell Ranger - Official 10x Genomics Support https://www.10xgenomics.com/support/software/cell-ranger/latest/tutorials/cr-tutorial-in#tutorial)

最重要的是别忘了添加Cellranger到环境路径中:

 export PATH=/home/data/t140334/Single_cellranger/cellranger-8.0.1:$PATH
 which cellranger

cellranger count 重要参数

参数作用
--id必需】唯一的运行 ID 字符串(如 sample345)。该名称是任意的,将用于命名包含所有管道生成的文件和输出的目录。只允许使用字母、数字、下划线和连字符(最多 64 个字符)。
--output-dir【非必要】用于存储运行结果的自定义输出目录的路径。如果不使用该参数,输出结果将被导入默认路径:/path/to/ID/outs/。
--transcriptome必需】包含 10x 兼容转录组参考文献的文件夹路径。
--fastqsfastq_path 文件夹的路径
--sample必需】提供给 FASTQ 生成软件的样本表中指定的样本名称。
--create-bam必需】启用或禁用 BAM 文件生成。设置--create-bam=false可减少总计算时间和输出目录的大小(不生成 BAM 文件)。
--feature-ref必需】特征条形码分析所必需。特征参考 CSV 文件的路径,该文件声明了实验中使用的特征条形码试剂。

运行

结合原文给的代码

/mnt/isilon/tan_lab/xuj5/software/cellranger4-0/cellranger-4.0.0/cellranger count --fastqs=/mnt/isilon/tan_lab/bandyopads/SB15_normal_huMSC_scRNA_deJong/data/CBM1_1/ --id=CBM1_1 --transcriptome=/mnt/isilon/tan_lab/chenc6/Tools/SingleCellAnnotation/GRCh38

还有官方示例:

[Running Cell Ranger count - Official 10x Genomics Support https://www.10xgenomics.com/support/software/cell-ranger/latest/analysis/running-pipelines/cr-gex-count#cr-count-gex-fb)

cd /home/jdoe/runs
cellranger count --id=sample345 \
           --transcriptome=/opt/refdata-gex-GRCh38-2020-A \
           --fastqs=/home/jdoe/runs/HAWT7ADXX/outs/fastq_path \
           --sample=mysample \
           --create-bam=true \
           --localcores=8 \
           --localmem=64

再根据自己的文件夹路径修改一下:

/home/data/t140334/Single_cellranger/cellranger-8.0.1/cellranger 

如果是多样本呢

多样本分析 [cellranger安装以及使用- https://blog.csdn.net/m0_59974475/article/details/138108362)

x=("24h_CK" "Ac_NPV_0h" "CK_0h")
for sample_id in "${x[@]}"; do
  fastq_path=".~/jwy/snrna-seq/$sample_id"  
  cellranger count --id $sample_id --fastqs ~/jwy/snrna-seq/$sample_id --transcriptome AcMNPV_genome --create-bam false --nosecondary
done

灵活变通一下:

x=("S10"  "S11"  "S12"  "S13"  "S9")
 for sample_id in "${x[@]}"; do   fastq_path=/home/data/t140334/DeJong_Preprocessing/$sample_id;   /home/data/t140334/Single_cellranger/cellranger-8.0.1/cellranger count --id=$sample_id --fastqs=$fastq_path --transcriptome=/home/data/t140334/References_CellRanger/refdata-gex-GRCh38-2020-A --create-bam false --nosecondary; done

绝对路径更保险!!!

还有个问题,这里其实应该规定一个output-dir的,这样文件输出会比较规整~

输出

看看自己的:

和官方的对比一下:

现在我有多个filtered_feature_bc_matrix.h5文件放在不同样本对应的文件夹下,我们的目的是把这些文件取出来放到一个新的文件夹下便于下游分析,该怎么办呢?

是否能通过sh脚本的方式来达到这样的目的呢?

首先创建一个jio本

nano extract_h5_files.sh

然后在编辑界面输入:

#!/bin/bash

# 目标文件夹
destination="/all_filtered_files" ##这里最好用绝对路径

# 创建目标文件夹
mkdir -p "$destination"

# 查找以 S 开头的文件夹
for s_dir in S*/; do
  # 查找 outs 文件夹中的 filtered_feature_bc_matrix.h5 文件
  h5_file="${s_dir}outs/filtered_feature_bc_matrix.h5"
  if [ -f "$h5_file" ]; then
    # 提取 S 文件夹的名称作为前缀
    prefix=$(basename "$s_dir")
    # 复制文件并添加前缀
    cp "$h5_file" "$destination/${prefix}_filtered_feature_bc_matrix.h5"
  fi
done

echo "所有文件已复制到 $destination"

感谢万能的chatgpt~

chmod +x extract_h5_files.sh
./extract_h5_files.sh

看看文件夹的内容是否与预期一致——

cellranger到seurat对象

这回我们已经拿到了五个样本的h5文件,常规流程走起来——

if(T){
  dir='ExternalDatasets/DeJong/' 
  samples=list.files( dir )
  samples 
  sceList = lapply(samples,function(pro){ 
    # pro=samples[5] 
    print(pro)  
   
    sce =CreateSeuratObject(counts =  Read10X_h5(file.path(dir,pro)) ,
                            project =  pro  ,
                            min.cells = 5,
                            min.features = 300  )
    return(sce)
  }) 
  do.call(rbind,lapply(sceList, dim)) 
  sce.all=merge(x=sceList[[1]],
                y=sceList[ -1 ],
                add.cell.ids = str_split(samples,'_',simplify = T)[,1]  ) 
}

names(sce.all@assays$RNA@layers)
sce.all[["RNA"]]$counts 
# Alternate accessor function with the same result
LayerData(sce.all, assay = "RNA", layer = "counts")
sce.all <- JoinLayers(sce.all)
dim(sce.all[["RNA"]]$counts )
# 24250 27633

as.data.frame(sce.all@assays$RNA$counts[1:101:2])
head(sce.all@meta.data, 10)
table(sce.all$orig.ident) 

sce.all$orig.ident=str_split(sce.all$orig.ident,'[_]',simplify = T)[,1]
table(sce.all$orig.ident)
sce.all

完成!