haowenz / chromap

Fast alignment and preprocessing of chromatin profiles
https://haowenz.github.io/chromap/
MIT License
189 stars 20 forks source link

[BUG] Support for combinatorial barcode indexing(like SHARE) not present #156

Open emattei opened 6 months ago

emattei commented 6 months ago

Hi, I am interested in using chromap to run SHARE-seq data. The barcodes come from three rounds and splitting and pooling. These three barcodes (8bp each) should be corrected individually allowing 1 mismatch but chromap requires to pass a list of 7M barcodes- that is the cartesian product of the R1R2R3. This is not correct because it will be 1 mismatch in 24 bps instead of 1 mismatch for each round of 8bp barcode.

I see that in the README is stated "This option also supports combinatorial barcoding, such as SHARE-seq. " Is combinatorial barcoding really supported but how to pass the whitelist in this case is not documented?

Thank you

mourisl commented 6 months ago

Do you have a whitelist for 8bp? It seems that would be 4^8=65536 entries in the whitelist at most, which might be easy to have conflict. I think the current best way is to run Chromap without whitelist, and correct the barcode later by collecting the real barcode based on abundances or filter the barcode with too few reads.

emattei commented 6 months ago

yes I have a barcode whitelist and it is 192 barcodes long. There are not conflicts and all the barcodes are 3 hamming distance away from each other. Here the problem is that if I pass a read format like this "bc:65:72,bc:103:110,bc:141:148,r1:0:-1,r2:0:49" where I pass the three barcodes, chromap expects the whitelist to contain 24 bp barcodes and correct using a distance of 1 or 2 which is not the correct way. Each 8bp barcode should be corrected independently against the 192 possibilities.

It seems like you are confirming that chromap doesn't support correction for combinatorial barcoding. Is that a correct statement?

mourisl commented 6 months ago

Right. The current version of Chromap concatenates the barcode segments first and then conducts error correction. We can add the feature to support segment-wise error correction in the future version.

I think 1 correction in the 24bp can still resolve most of the barcode sequencing errors.