r3fang / SnapATAC

Analysis Pipeline for Single Cell ATAC-seq
GNU General Public License v3.0
298 stars 124 forks source link

How to perform Barcode selection if don't have a 10X CSV file? #139

Open dioncst opened 4 years ago

dioncst commented 4 years ago

Hi~ I got my data(.bam file) at http://atlas.gs.washington.edu/mouse-atac/data/. And I generated snap file using the bam file according to your instruction. But I don't have a CSV file that conatin the information as your example (atac_v1_adult_brain_fresh_5k_singlecell.csv, http://renlab.sdsc.edu/r3fang/share/github/Mouse_Brain_10X/atac_v1_adult_brain_fresh_5k_singlecell.csv) I'm wondering how can I perform barcode selection under this circumstance?

Best, Shitao

r3fang commented 4 years ago

Hello,

You can select cells based on "coverage" and "promoter ratio" - percentage of reads mapped to promoters per cell

> library(GenomicRanges);
> system("wget http://renlab.sdsc.edu/r3fang/share/Fang_2019/MOs_snATAC/genes/promoter.bed");
> promoter.df = read.table("promoter.bed");
> promoter.gr = GRanges(promoter.df[,1], IRanges(promoter.df[,2], promoter.df[,3]));
> ov = findOverlaps(x.sp@feature, promoter.gr);
> idy = queryHits(ov);
> cov = log10(SnapATAC::rowSums(x.sp, mat="bmat")+1)
> promoter_ratio = SnapATAC::rowSums(x.sp[,idy, mat="bmat"], mat="bmat") / SnapATAC::rowSums(x.sp, mat="bmat");
> plot(cov, promoter_ratio, cex=0.5, col="grey", xlab="log(count)", ylab="FIP Ratio", ylim=c(0,1 ));
# chose the cutoff based on the plot
> idx = which(promoter_ratio > 0.2 & promoter_ratio < 0.8 & cov > 3);
> x.sp = x.sp[idx,];
r3fang commented 4 years ago

"promoter.bed" is a list of promoters in mouse genome (mm10), you might need to find another list for mm9

dioncst commented 4 years ago

"promoter.bed" is a list of promoters in mouse genome (mm10), you might need to find another list for mm9

Sorry, I'm not abale to download this "promoter.bed" file through the link. How is it generated?

dioncst commented 4 years ago

"promoter.bed" is a list of promoters in mouse genome (mm10), you might need to find another list for mm9

Could you please send me this promoter.bed file if possible,my email is dioncst@sjtu.edu.cn. I thik I can use UCSC liftover to convert it to mm9. Or I can use CrossMap to convert my .bam to mm10.

r3fang commented 4 years ago

Hi,

I have create a folder that contains reference files required for the analysis: http://renlab.sdsc.edu/r3fang/share/github/reference/

Inside the mm9 folder, you will find the gencode.vM1.annotation.gtf file which is GENCODE annotation for mm9 genome. Please download this file.

To select the useful barcode, I am using the 10X Mouse Brain dataset as an example here. The snap file used in this analysis can be found here

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("rtracklayer")

library(rtracklayer)
library(GenomicRanges)
library(SnapATAC)

gtf.gr <- rtracklayer::import("gencode.vM19.annotation.gtf")
gene.gr <- gtf.gr[gtf.gr$type == "gene"]

# extract promoter region for each gene
promoter.gr <- reduce(promoters(gene.gr, upstream=2000, downstream=0))

# create the snap object
x.sp = createSnap("Mouse_Brain_10X/atac_v1_adult_brain_fresh_5k.snap", sample="MB")

# load the cell-by-bin matrix
x.sp = addBmatToSnap(x.sp)

# load the cell-by-bin matrix
ov = findOverlaps(x.sp@feature, promoter.gr);

# find promoter overlapping bins 
idy = queryHits(ov);
log_cov = log10(SnapATAC::rowSums(x.sp, mat="bmat")+1)
promoter_ratio = Matrix::rowSums(x.sp@bmat[,idy]) / Matrix::rowSums(x.sp@bmat);
plot(log_cov, promoter_ratio, cex=0.5, col="grey", xlab="log(count)", ylab="FIP Ratio", ylim=c(0,1 ));

image

# chose the cutoff based on the plot 
idx = which(promoter_ratio > 0.2 & promoter_ratio < 0.8 & log_cov > 3);
x.sp = x.sp[idx,]
x.sp
number of barcodes: 4108
number of bins: 546206
number of genes: 0
number of peaks: 0
number of motifs: 0
dioncst commented 4 years ago
gene.gr <- gtf.gr[gtf$type == "gene"]

Thank a lot~ But seems this line do not work with the GTF file I downloaded at ENCODE. Error in gtf.gr[gtf$type == "gene"] : object 'gtf' not found

So, I tried this .

gene.gr <-  gtf.gr[gtf.gr@elementMetadata@listData[["type"]] == "gene"]
r3fang commented 4 years ago

sorry, it is a typo. should be gene.gr <- gtf.gr[gtf.gr$type == "gene"]

trinevd commented 4 years ago

Hi, thanks for a great tool. I am in the learning process, so it might be naive question.

For barcode selection I did what you did in the answer I quoted here. I'm now wondering if the log(count) plotted on the x-axis in the end only contains unique fragments or also duplicate fragments? I have read in another issue that duplicates are removed (by default?) in the step where you use the command "snaptools snap-pre" , so I guess duplicates should not be in the data set anymore at this barcode selection step?

Best, Trine

Hi, I have create a folder that contains reference files required for the analysis: http://renlab.sdsc.edu/r3fang/share/github/reference/ Inside the mm9 folder, you will find the gencode.vM1.annotation.gtf file which is GENCODE annotation for mm9 genome. Please download this file. To select the useful barcode, I am using the 10X Mouse Brain dataset as an example here. The snap file used in this analysis can be found here if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("rtracklayer")

library(rtracklayer) library(GenomicRanges) library(SnapATAC)

gtf.gr <- rtracklayer::import("gencode.vM19.annotation.gtf") gene.gr <- gtf.gr[gtf.gr$type == "gene"]

extract promoter region for each gene

promoter.gr <- reduce(promoters(gene.gr, upstream=2000, downstream=0))

create the snap object

x.sp = createSnap("Mouse_Brain_10X/atac_v1_adult_brain_fresh_5k.snap", sample="MB")

load the cell-by-bin matrix

x.sp = addBmatToSnap(x.sp)

load the cell-by-bin matrix

ov = findOverlaps(x.sp@feature, promoter.gr);

find promoter overlapping bins

idy = queryHits(ov); log_cov = log10(SnapATAC::rowSums(x.sp, mat="bmat")+1) promoter_ratio = Matrix::rowSums(x.sp@bmat[,idy]) / Matrix::rowSums(x.sp@bmat); plot(log_cov, promoter_ratio, cex=0.5, col="grey", xlab="log(count)", ylab="FIP Ratio", ylim=c(0,1 ));

chose the cutoff based on the plot

idx = which(promoter_ratio > 0.2 & promoter_ratio < 0.8 & log_cov > 3); x.sp = x.sp[idx,] x.sp number of barcodes: 4108 number of bins: 546206 number of genes: 0 number of peaks: 0 number of motifs: 0

JBreunig commented 4 years ago

idx = which(promoter_ratio > 0.2 & promoter_ratio < 0.8 & log_cov > 3); x.sp = x.sp[idx,] x.sp

This seems to mess up the corresponding downstream metadata. e.g. I get errors on the scaleCountMatrix that the metaData numbers don't correspond.

If this is the standard qc

barcodes.sel = barcodes[which(UMI >= 3 & UMI <= 5 & promoter_ratio >= 0.15 & promoter_ratio <= 0.6),]; rownames(barcodes.sel) = barcodes.sel$barcode; x.sp = x.sp[which(x.sp@barcode %in% barcodes.sel$barcode),]; x.sp@metaData = barcodes.sel[x.sp@barcode,]; x.sp

Should there be a line akin to this?

x.sp@metaData = barcodes.sel[x.sp@barcode,];

tzhu-bio commented 4 years ago

I did the barcode selection the way you provided, but here's what I got. Do you have any suggestions? Thanks image