Repetitive or low-quality barcode sequences in scATAC data

Hi @haowenz, I tend to not force use of the 10X barcode include-list since it is possible there could be valuable information/real cells, but I noticed that some scATACseq data processed via chromap has identified some "cells" that manage to pass all QC, and escape doublet discrimination (!), but whose barcode sequence is something like GGGGGGGGGGGGTGGG or a similar highly-G-rich/low-complexity sequence. This isn't a bug per se of chromap, and can obviously be fixed by forcing the identified cells to be contained within the include-list, however I do wonder whether barcodes like this could perhaps be flagged if they have exceedingly low entropy? There were about 200 such "cells" out of ~70,000 in the dataset I'm currently working with, so it's rare but it was frequent enough such that these cells formed their own cluster in my data. Curious to hear your thoughts! Thanks!

haowenz / chromap

Repetitive or low-quality barcode sequences in scATAC data #161