haowenz / chromap

Fast alignment and preprocessing of chromatin profiles
https://haowenz.github.io/chromap/
MIT License
184 stars 18 forks source link

Failure to load cellular barcodes containing Ns #166

Open ste-depo opened 1 week ago

ste-depo commented 1 week ago

Hi everybody,

Running chromap in the scATAC modality with the command:

chromap -q 0 -t 8 --min-read-length 20 --preset atac --SAM -x ${index_file} \
   -r ${fasta_file} -1 $r1 -2 $r2 -b $rb --barcode-whitelist whitelist.txt \
   -o ${prefix}.sam 2> ${prefix}.log

the software raises an exception related the barcodes contained in the whitelist.txt file:

chromap: src/chromap.cc:359: void chromap::Chromap::LoadBarcodeWhitelist(): Assertion `khash_return_code != -1 && khash_return_code != 0' failed.

To what I have understood, the exception is related to the fact that some of the barcodes contain non ATCG letters, such as Ns. I'm telling this because removing those barcodes from the whitelist solves the issue.

Is this intended?

In fact, those barcodes are associated to a non-negligible number of reads!

Best,

Stefano

mourisl commented 1 week ago

Do you mean that the whitelist allows wildcards represented by N?

ste-depo commented 1 week ago

It seems like. The whitelist was generated by UMI-tools (version: 1.1.1), using the command:

umi_tools whitelist --method=reads --extract-method=string --bc-pattern=CCCCCCCCCCCCCCCC -I ${SAMPLE}_R2.fastq.gz -S ${SAMPLE}_whitelist.tsv --plot-prefix=${SAMPLE} --set-cell-number=${n_cells} --subset-reads=10000000000

This is a barcode found containing Ns:

NAAAGTAGACTTAGTG NAAAGTACACTTAGTG,NAAAGTAGAATTAGTG,NAAAGTAGACATAGTG,NAAAGTAGACCTAGTG,NAAAGTAGACTCAGTG,NAAAGTAGACTGAGTG,NAAAGTAGACTTACTG,NAAAGTAGACTTAGTC,NAAAGTAGACTTAGTT,NAAAGTAGACTTATTG,NAAAGTAGACTTCGTG,NAAAGTAGACTTTGTG,NAAAGTAGAGTTAGTG,NAAAGTAGTCTTAGTG,NAAAGTCGACTTAGTG,NAAAGTGGACTTAGTG,NAAAGTTGACTTAGTG,NAAATTAGACTTAGTG,NAAGGTAGACTTAGTG,NGAAGTAGACTTAGTG,NNAAGTAGACTTAGTG,NTAAGTAGACTTAGTG 1968 1,5,1,1,2,1,1,2,5,1,1,2,1,1,1,2,1,1,1,4,1,1

mourisl commented 1 week ago

I think umi_tools infer the barcode whitelist from the reads, which may contain N's (not sure about this). You may filter those barcode with Ns in your whitelist, and Chromap will try to fix those N's in the read.