jdblischak / singleCellSeq

Batch effects and the effective design of single-cell gene expression studies
http://jdblischak.github.io/singleCellSeq/analysis
Other
97 stars 68 forks source link

A query about dealing with bam files #58

Closed WT215 closed 5 years ago

WT215 commented 5 years ago

Hi,

Following the "Read mapping" section in your paper, now I have successfully obtained 2592 bam files for the single cells.

I am going to do the next step :

To obtain gene-level counts, we assigned reads to protein-coding genes (Ensembl GRCh37 release 82) and the ERCC spike-in genes using featureCounts.

When I look into the file singleCellSeq/code/Snakefile, it seems that I need to merge, sort and index those bam files.

Before those steps, should I merge bam files at first so that each bam file represents one cell? (Currently I have 2592 bam files. Since there are 864 cells, I guessed that every three of bam files correspond to one cell? Does three mean the library was sequenced in three lanes?)

Thank you very much!

Best wishes, Wenhao

jdblischak commented 5 years ago

Before those steps, should I merge bam files at first so that each bam file represents one cell?

Yes, you are correct. This is a critical step. You need to combine them into a single file before deduplicating the UMIs. If you did each file separately and then combined them, the UMI counts would be inflated because any given UMI might be present in more than one of the BAM files.

Does three mean the library was sequenced in three lanes?

Yes. Each master mix was sequenced in 3 lanes, and thus there are 3 FASTQ (or BAM) files per single cell sample.

When I look into the file singleCellSeq/code/Snakefile, it seems that I need to merge, sort and index those bam files.

You don't want to pay much attention to that Snakefile. I used it to analyze a small pilot data set. Because the HPC cluster we used to perform this analysis couldn't handle a huge Snakemake-based pipeline, I had to switch to using array jobs. The full description of the steps is in Process sequence data. If you've mapped the samples, the next step is process the BAM files and then combine the single cell samples.

The scripts were written to use Sun Grid Engine array jobs. You can either adapt these for use with your cluster's scheduling software, or you can extract the relevant code into your own Snakemake pipeline.

My more recent projects use Snakemake to process the entire data set (we switched to using a better HPC cluster), so these may be helpful if you want to implement a Snakemake pipeline. For example, for the project singlecell-qtl, the Snakefile first combines the FASTQ files prior to mapping.

WT215 commented 5 years ago

@jdblischak Thank you so much! I am crystal clear now!

Best wishes, Wenhao