r3fang / SnapATAC

Analysis Pipeline for Single Cell ATAC-seq
GNU General Public License v3.0
307 stars 126 forks source link

differcent barcodes between cellranger-atac and sanpATAC #58

Open xinhuang420 opened 5 years ago

xinhuang420 commented 5 years ago

Hi, Rongxin,

SnapATAC is fast and has good performance. I really like it.

when I used snapATAC on my own data from 10X, I found the barcodes in bam file after 'snaptools align-paired-end ' were totally different from the barcodes in bam file from cellranger-atac. Cellranger-atac said they fix the occasional sequencing error in barcodes.

So, why snapATAC doesn't correct the barcode?

(https://support.10xgenomics.com/single-cell-atac/software/pipelines/latest/algorithms/overview Barcode Processing in cellranger-atac: This step is performed in order to fix the occasional sequencing error in barcodes so that fragments get associated with the original barcodes, thus improving data quality. The 16bp barcode sequence is obtained from the "I2" index read. Each barcode sequence is checked against a 'whitelist' of correct barcode sequences, and the frequency of each whitelist barcode is counted. We attempt to correct barcodes that aren't on the whitelist, by finding all whitelisted barcodes that are within 2 differences (Hamming distance <= 2) of the observed sequence, and scoring them based on the abundance of that barcode in the read data and quality value of the incorrect bases. An observed barcode not present in the whitelist is corrected to a whitelist barcode if it has > 90% probability of being the real barcode based on this model.)

BTW, could you please tell me how to get the supplementary of 'Fast and Accurate Clustering of Single Cell Epigenomes Reveals Cis-Regulatory Elements in Rare Cell Types'? I want to learn more about snapATAC.

Thank you very much!

Best, Xin

r3fang commented 5 years ago

Hi Xin,

1) The 10X barcodes all have “-1” as suffix, which I am fixing now. You can manually change the barcode in SnapATAC x.sp@barcode

2) I would love to share with you the supplementary materials, however, as our method has evolved significantly from the original bioRiv version, it might make more sense to share a version when it is finished. Also, the GitHub will also be updated in a week or two, keep tuned.

Best,

Rongxin Fang, Ren Lab Ludwig Cancer Research Bioinformatics Ph.D. Student University of California, San Diego

On Jul 12, 2019, at 2:28 PM, xin notifications@github.com wrote:

Hi, Rongxin,

SnapATAC is fast and has good performance. I really like it.

when I used snapATAC on my own data from 10X, I found the barcodes in bam file after 'snaptools align-paired-end ' were totally different from the barcodes in bam file from cellranger-atac. Cellranger-atac said they fix the occasional sequencing error in barcodes.

So, why snapATAC doesn't correct the barcode?

(https://support.10xgenomics.com/single-cell-atac/software/pipelines/latest/algorithms/overview https://support.10xgenomics.com/single-cell-atac/software/pipelines/latest/algorithms/overview Barcode Processing in cellranger-atac: This step is performed in order to fix the occasional sequencing error in barcodes so that fragments get associated with the original barcodes, thus improving data quality. The 16bp barcode sequence is obtained from the "I2" index read. Each barcode sequence is checked against a 'whitelist' of correct barcode sequences, and the frequency of each whitelist barcode is counted. We attempt to correct barcodes that aren't on the whitelist, by finding all whitelisted barcodes that are within 2 differences (Hamming distance <= 2) of the observed sequence, and scoring them based on the abundance of that barcode in the read data and quality value of the incorrect bases. An observed barcode not present in the whitelist is corrected to a whitelist barcode if it has > 90% probability of being the real barcode based on this model.)

BTW, could you please tell me how to get the supplementary of 'Fast and Accurate Clustering of Single Cell Epigenomes Reveals Cis-Regulatory Elements in Rare Cell Types'? I want to learn more about snapATAC.

Thank you very much!

Best, Xin

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/r3fang/SnapATAC/issues/58?email_source=notifications&email_token=ABT6GG7DXNPVCZ6AN6CYFH3P7DZPTA5CNFSM4ICRT4WKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4G67D7MA, or mute the thread https://github.com/notifications/unsubscribe-auth/ABT6GG3SHJR72XW626AOI6LP7DZPTANCNFSM4ICRT4WA.

hamishking commented 5 years ago

Hi Rongxin,

This is more than the "-1" suffix problem. As Xin points out, 10X have a barcode correction in play that means that the barcode list from cellranger is very different to that generated by SnapATAC from the same datasets. I found this trying to integrate cell type predictions from the new Seurat ATAC pipeline (which uses barcodes from the cellranger-atac generated filtered_peak_bc_matrix.h5 and singlecell.csv files) with my SnapATAC data - no barcode matches between the two at all!

Any thoughts?

Cheers,

Hamish

hamishking commented 5 years ago

I'm just now trying to make the Snap file from the 10X possorted.bam file rather than from FASTQ files. I'll let you know how I go.

r3fang commented 5 years ago

Hi,

1) SnapATAC does not change the barcode as 10x does 2) from my experience, most of the barcodes still match with cell-ranger barcode 3) you can also create a snap file from fragment.tsv.gz file

hamishking commented 5 years ago

All my problems were solved when I started from the 10X possorted.bam file and followed the instructions here in the FAQs. https://github.com/r3fang/SnapATAC/wiki/FAQs#cellranger_output Thanks again for a great package!