haowenz / chromap

Fast alignment and preprocessing of chromatin profiles
https://haowenz.github.io/chromap/
MIT License
188 stars 20 forks source link

low map rate for 10X scATAC-seq #15

Open badoi opened 3 years ago

badoi commented 3 years ago

I'm aligning 10x genomics scATAC-seq to the rheMac10 genome. The sequencing is 50bp for R1 and R2 paired-end sequencing on the Novaseq. The genome was built with k17 and w7 for min frag 30. But the alignment rate is 7% which is too low. For 10x scATAC-seq datasets, how was the genome built to better map fragments as low as 20 or 30 bp after index trimming? Are there any other advice to work with this short of sequencing?

In a different dataset where the sequencing strategy was 100bp for R1 and R2, the same macaque index aligned 91% of the reads.

Mapped all reads in 55348.73s.
Number of reads: 1132465712.
Number of mapped reads: 77938628.
Number of uniquely mapped reads: 72568652.
Number of reads have multi-mappings: 5369976.
Number of candidates: 781477061.
Number of mappings: 77938628.
Number of uni-mappings: 72568652.
Number of multi-mappings: 5369976.
Number of barcodes in whitelist: 2890.
Number of corrected barcodes: 41349117.
Sorted, deduped and outputed mappings in 17.12s.

chromap_1741011_1_out.txt

haowenz commented 3 years ago

Can you share the command line you used to run Chromap? It seems that your barcode whitelist file does not match the barcodes in your data. So most of the reads are dropped since their barcodes are not on the whitelist.

badoi commented 3 years ago

The command I used was the following. I'll check but I remember that 10x only had 1 ATAC barcode list, which is the list I provided. The issue first is that only 77938628/ 1132465712 reads mapped, which is ~6% of reads.

BARCODE_LIST=/home/bnphan/src/cellranger-atac-1.2.0/cellranger-atac-cs/1.2.0/lib/python/barcodes/737K-cratac-v1.txt
GENOME_FASTA=/home/bnphan/resources/genomes/rheMac10/rheMac10.fa
CHROMAP_IDX=/home/bnphan/resources/genomes/chromap/rheMac10/index
ALIGN_BED=${SAMPLE}.aln.bed

~/src/chromap-0.1_x64-linux/chromap --preset atac \
-x $CHROMAP_IDX -r ${GENOME_FASTA} -t 8 \
--remove-pcr-duplicates-at-cell-level \
--bc-error-threshold 2 \
-1 ${FQ1} -2 ${FQ2} -b ${BC1} -o ${ALIGN_BED} \
--barcode-whitelist ${BARCODE_LIST}
haowenz commented 3 years ago

Chromap will check if the barcode is in the whitelist, if not then it tries to correct the barcode. And if the barcode is still not in the whitelist, it will skip mapping the reads. Can you try a run with the parameters you used and also with "--output-mappings-not-in-whitelist" to see if the mapping rate is still low?

Moreover, can I know the 10X assay you used? Single Cell ATAC or Single Cell Multiome? They have different barcode whitelists. Also the time looks weird. Is it possible for you to share 1M read pairs so we may have a look?

badoi commented 3 years ago

Thanks I'll run with that parameter. I tried aligning over the weekend with K=15 and W=7 and got a lower map rate, so I'll try to go for higher K' and W' indices then with your parameter to see if the barcode whitelist is the issue.

Our collaborators used the scATAC protocol. They haven't tried the Multiome yet.

chromap_1741095_1_out.txt

mourisl commented 3 years ago

Based on the two lines: Number of barcodes in whitelist: 2890. Number of corrected barcodes: 41349117.

It seems only a tiny fraction of reads' barcodes are in the whitelist. Could you please check whether the barcode used in the scATAC-seq data and the barcode whitelist are matched? Thank you.

badoi commented 3 years ago

Hi all, I reached out to 10x and they advised that some sequencers provide the reverse complement of the barcodes, so I made a rev-comp whitelist. I get more hits w/ that list, and rerunning now w/ an index built with K20 W7. That should fix both problems of low alignment rate and not enough matches w/ the whitelist. I'll confirm this is indeed correct once the run finishes.

For others who find it useful, the unix command to make a rev comp of the whitelist:

cat 737K-cratac-v1.txt | tr ACGT TGCA | rev | sort > 737K-cratac-v1_revcomp.txt
haowenz commented 3 years ago

Please also try a run with index built with "-k 17 -w 7", which is default. It may give you the better results. Increasing k seems not necessary to me.

badoi commented 3 years ago

The initial runs were with an index built w/ K17 W7, which produced very low map rates.

haowenz commented 3 years ago

Did you try "--output-mappings-not-in-whitelist" with "k17 w7" and get low mapping rate? Or you can just remove "-b" and run your data as bulk ATAC-seq data and see if the reads map.

It seems to me that you got you low mapping rate in the initial run since your barcode whitelist did not have the reverse complement of the barcodes. Those reads are dropped without even trying the mapping procedure. In this case you will always get low mapping rate, and the value of k and w have no effect on the results. When you use the right barcode whitelist file, the default parameters are supposed to work. Otherwise, there might be other problems.

You can try larger k if you want. But make sure you use an odd value for k (e.g, 19, 21).

haowenz commented 3 years ago

@mourisl I think we should add a new mapping stats saying the number of reads skipped since their barcodes are not on the whitelist. This would help the users and us to identify the problem in the data. What do you think?

mourisl commented 3 years ago

Yes. We should be more clear that Chromap will skip the reads whose barcodes are not in the whitelist, at least in the README. I think the line "Number of barcodes in whitelist: 2890. Number of corrected barcodes: 41349117." serves the purpose, but the user might not know that the other reads are filtered by default.

badoi commented 3 years ago

Now that you mentioned all of that, I think I get the error now. I aligned w/ the default index k17 w7, with the reverse complement barcode list and looks like it's working well. Thanks for your feedback, and I agree that a bit clearer mapping statistics could help warn about this quirk in the future.

Number of reads: 1096959080.
Number of mapped reads: 992840174.
Number of uniquely mapped reads: 917016382.
Number of reads have multi-mappings: 75823792.
Number of candidates: 9525419205.
Number of mappings: 992840174.
Number of uni-mappings: 917016382.
Number of multi-mappings: 75823792.
Number of barcodes in whitelist: 482067912.
Number of corrected barcodes: 44422687.
Sorted, deduped and outputed mappings in 346.11s.
# uni-mappings: 213743382, # multi-mappings: 28074758, total: 241818140.
Number of output mappings (passed filters): 207028528
Total time: 32917.30s.
dannyconrad commented 2 years ago

Thank you @badoi for posting the reverse-complement tip. I was having a similar map rate issue and using the revcomp whitelist fixed it for me 👍

Just as a suggestion to the authors/developers, this would be a good thing to make a note of somewhere in the README (unless you have and I missed it), since this will be a problem for anyone aligning scATACseq reads sequenced via the reverse complement workflow (including NovaSeq w/ v1.5 reagents, HiSeq 4000, NextSeq, etc.).

Additionally, to ease some of the confusion around how mapping rates are achieved, it might make sense to order the final alignment statistics in the order that they are prioritized for filtering. Without the above thread, I also would not have known that only reads with a confirmed cell barcode are used for mapping (though of course, that makes a lot of sense for efficiency), since that stat is one of the last reported.

haowenz commented 2 years ago

Thank you so much for the suggestions! They make a lot of sense. Li and I will take care of this.

dannyconrad commented 2 years ago

I should also add that I am blown away with how fast Chromap is, by the way. I couldn't believe my eyes when you reported 16X speed enhancement over cellranger-atac, yet here I am aligning 5 different single-cell libraries in a single workday. Excellent work :)

mourisl commented 2 years ago

Thank you for the suggestions and the information on the reverse-complement barcodes. We just added a check of whether the barcodes from the data are too different from the whitelist. If they are too different, Chromap will stop early and warn the user about the whitelist issue. So you don't need to wait until the end to find that the whitelist needs to be modified. Hope this will be helpful.

dannyconrad commented 2 years ago

Excellent! That's a much more elegant solution, since it should also be useful in cases where users mistakenly use the wrong barcode whitelist. Thanks so much for addressing that so quickly.

bli25 commented 2 years ago

@mourisl @haowenz , does it make sense to add another option (e.g. '--barcode-reversion') to reverse complement barcode FASTQ? As @dannyconrad mentioned, Illumina's new machines (e.g. NextSeq, NovaSeq) implement the reverse complement workflow, in which case index2 (barcode read for 10x) will be reverse complemented. I expect that having reverse complement of the barcode will be a very common use case. Reversing the white list is not an elegant solution and may introduce problems for 10x multiome (how to translate reverse complemented atac barcodes to RNA barcodes?).

Please let me know what you think.

Best, Bo

mourisl commented 2 years ago

I am planning to add the strand information to the option "--read-format" so Chromap will reverse the barcode or sequence when reading in the data. It should be done in the next a few days. Thanks for the suggestions!

bli25 commented 2 years ago

@mourisl, awesome, thanks a lot for keep developing this great software!

mourisl commented 2 years ago

We have extended the fields in the read-format option, and you can set the option like "--read-format bc:0:-1:-" to specify the reverse-complement barcode now.

bli25 commented 2 years ago

@mourisl ,

I detected one bug related to this new feature. When I set '--read-format bc:0:-1:-', I still got the following error message:

Less than 5% barcodes can be found or corrected based on the barcode whitelist. Please check whether the barcode whitelist matches the data, e.g. length, reverse-complement. If this is a false positive warning, please run Chromap with the option --skip-barcode-check.

mourisl commented 2 years ago

Thanks for the testing! I will check the implementation.

mourisl commented 2 years ago

@bli25 Hi Bo, yes, indeed there is a bug at the checking stage. I've fixed that in the "custom_readformat" branch, could you please give it a try? Thank you.