Detailed explanation of how to prep the 10X generated fastqs from multiple samples/wells for scATAC-pro.

jonasungerback commented 4 years ago

Hello,

thank you for a nice tool. I will be very valuble for us scATAC-seq beginners. I am just about to start an analysis of 10x scATAC-data and I have a quesion of how to prep this.

In your example you merge the different lanes from one sample (if I understand this correct):

$ cat atac_pbmc_10k_v1_S1_L001_R1_001.fastq.gz atac_pbmc_10k_v1_S1_L002_R1_001.fastq.gz > pe1_fastq $ cat atac_pbmc_10k_v1_S1_L001_R3_001.fastq.gz atac_pbmc_10k_v1_S1_L002_R3_001.fastq.gz > pe2_fastq $ cat atac_pbmc_10k_v1_S1_L001_R2_001.fastq.gz atac_pbmc_10k_v1_S1_L002_R2_001.fastq.gz > index_fastq

However, in my case, I have four samples (with four different barcodes) coming from 4 wells when generating the GEMs. Could you please provide an example/tutorial of how I initiate the pipeline using multiple samples?

Thanks in advance!

Jonas

wbaopaul commented 4 years ago

Have you got the fastq file for each sample already? If so, you can process the data sample by sample. Or if you want to pool the data from different samples together, you can just simply cat all R1 (or R2, R3 respectively) fastq files together. Use the same sample order when you cat the data for R1/R2/R3.

Wenbao

jonasungerback commented 4 years ago

Thank you! I expected that but wasn't sure. This is the first experiment on a NovaSeq so I am not sure how many lanes I will have but just for my full understadning it will be something like this if there are two lanes and two samples:

cat sample1.L001.R1.fastq.gz sample1.L002.R1.fastq.gz sample2.L001.R1.fastq.gz sample2.L002.R1.fastq.gz > final.R1.fastq.gz Then the same for R2 and R3 whereupon these three files are sent to

scATAC-pro -s process -i final.R1.fastq,final.R3.fastq,final.R2.fastq -c configure_user.txt

Related question: Is it possible to merge multiple bam-files from multiple cellranger count runs and use that as an input to scATAC-pro -s call_peak? And if so, would it be possible to add a tutorial for this since I think this will be a common question. Much like rf3ang has done for SnapAtac https://github.com/r3fang/SnapATAC/wiki/FAQs#10X_snap

This is a little unrelated of course so if you feel like it is approriate we can open a new issue for that.

Best, Jonas

wbaopaul commented 4 years ago

I guess the way you pool data is correct.

Thanks for your suggestion for the merge existing bam-files and do some downstream analysis, it's a very helpful function. I will work on it and give a tutorial soon.

Thanks, Wenbao

wbaopaul commented 4 years ago

See https://github.com/wbaopaul/scATAC-pro/wiki/FAQs for handling 10x cellranger-atac style output. I added a new module 'convert10xbam' to convert cellranger-atac style bam file to scATAC-pro style, thus all follow-up modules can be used. For merge bam files, user can simple try "samtools merge" in the first place. For some module like "call_peak", if you choose macs2, you can directy use cellranger-atac style bam file. Hope it helps.

Best, Wenbao

jonasungerback commented 4 years ago

Thank you! This is amazing and helps a lot. Just for clarification: If you have multiple bam files you simply run samtools merge, is it important to sort the merged file in anyway or can it be used directly with scATAC-pro. I hope I will be able to try thit out tomorrow or Tuesday.

Best, Jonas

jonasungerback commented 4 years ago

Hmm, this can be my lack of understanding but if catenating multiple samples (either at the fastq or bam-stage), will not the sample information be lost? I assume that what the demultiplexing step is doing is adding the UMI to the read name but would it also be possible to add sample specific information, for instance I1 in the case of cellranger output or a custom sample name? It would be neat to transfer this information to the clustering step and beyond so the different sample can be highlighted in the clusters. Maybe this information is there, and it is just my lack of understanding how this information is added to tag each sample.

Jonas

wbaopaul commented 4 years ago

You are right. If you merge different bam files, you need to resort the merged bam file. (You can use samtools sort). The added module "convert10xbam" assumes you have position sorted bam file.

The demultiplexing module supports multiple index fastq files, resulting in read name embedding with different indexes, separated by "_". For example in your case, you can add sample information to the read name like this:

scATAC-pro -s demplex_fastq -i PE1_fastq,PE2_fastq,UMIBarcode_fastq,sampleBarcode_fastq

The output read name will be something like: UMIBarcode_sampleBarcode:the_original_read_name.

jonasungerback commented 4 years ago

Ah, that is smart. I will try it out.

jonasungerback commented 4 years ago

This does not seem to be the behavior of demplx. If I do:

scATAC-pro -s demplex_fastq -i PE1_fastq,PE2_fastq,UMIBarcode_fastq,sampleBarcode_fastq only the UMIBarcode goes into the read name but if I do: scATAC-pro -s demplex_fastq -i PE1_fastq,PE2_fastq,sampleBarcode_fastq,UMIBarcode_fastq only the sample barcode goes in so it looks to me like it is taking the first three arguments and ignores the 4th. However, the sample identifier is in the index-sequence information in the fastq-file. Can this be used later to separate the samples in the output matrix?

wbaopaul commented 4 years ago

Just corrected a bug for demultiplexing multiple index files. I have tested it on a smaller data set. The module adds multiple index files one by one, which is pretty slow (will do it simultaneously in the next update).

Yes, the sample identifier is in the index-sequence fastq-file, sure you can extract it later but not convenient.

AnjaliC4 commented 4 years ago

Hi, Following up on this, I was also wondering the sequence of 10x chromium 8bp_sample index and 16bp_cell barcode in the read name header of the demultiplexed files with command- scATAC-pro -s demplex_fastq -i pe1.fq, p2.fq, barcode.fq, sample_index.fq. The output generated from 1 sample was : -- sampleindex_Barcode:the_original_read_name, and not : -- Barcode_sampleindex:the_original_read_name So, after peak and cell calling, the fragments.txt and other QC files generated have, for example: sampleindex_barcode chr1 start end I am not sure if the sorting of the peaks was done according to cell barcodes and the pipeline automatically accounts the differences between different sampleindex and barcodes (sepearted by '') , or were the peaks got sorted according to sampleindexbarcode, which would give a different result, I guess?.
Thank you so much !

wbaopaul commented 4 years ago

You may switch the order of barcode.fq and sample_index.fq to get Barcode_sampleindex:original_read_name. The pipeline does not automatically accounts the differences between different sampleindex and bacodes. The purpose of keep both sampleindex and barcodes is to enable some users to extract/separate sample infromation from fragments.txt file. The peaks were identified using aggregate reads from all samples and peaks are not sorted based on barcode.

AnjaliC4 commented 4 years ago

Thanks!

wbaopaul / scATAC-pro

Detailed explanation of how to prep the 10X generated fastqs from multiple samples/wells for scATAC-pro. #1