haowenz / chromap

Fast alignment and preprocessing of chromatin profiles
https://haowenz.github.io/chromap/
MIT License
189 stars 20 forks source link

Repetitive or low-quality barcode sequences in scATAC data #161

Open jeremymsimon opened 5 months ago

jeremymsimon commented 5 months ago

Hi @haowenz, I tend to not force use of the 10X barcode include-list since it is possible there could be valuable information/real cells, but I noticed that some scATACseq data processed via chromap has identified some "cells" that manage to pass all QC, and escape doublet discrimination (!), but whose barcode sequence is something like GGGGGGGGGGGGTGGG or a similar highly-G-rich/low-complexity sequence. This isn't a bug per se of chromap, and can obviously be fixed by forcing the identified cells to be contained within the include-list, however I do wonder whether barcodes like this could perhaps be flagged if they have exceedingly low entropy? There were about 200 such "cells" out of ~70,000 in the dataset I'm currently working with, so it's rare but it was frequent enough such that these cells formed their own cluster in my data. Curious to hear your thoughts! Thanks!

mourisl commented 5 months ago

I think this issue can be better handled during downstream analysis. As you said, these barcodes can be easily identified, but the filtering cutoff (like entropy) might be tunable. It would be more efficient to find the appropriate cutoffs in the dataframe rather than re-running Chromap.