haowenz / chromap

Fast alignment and preprocessing of chromatin profiles
https://haowenz.github.io/chromap/
MIT License
192 stars 21 forks source link

Barcode not in translation table #117

Closed jeremymsimon closed 1 year ago

jeremymsimon commented 2 years ago

Hi, I have a 10x multiomics (RNA+ATAC) experiment, and am trying to use chromap for the ATAC portion and associate it with RNA analyzed by alevin-fry. My ATAC files are in the format with R1, R2, I1, and I2, and I made a barcode translation table to map the ATAC barcodes to the matched RNA sequences. Unfortunately chromap is failing with the error Barcode does not exist in the translation table and I can't quite seem to figure out the right configuration.

My barcode translation table was made as follows:

paste -d'\t' <(zcat /path/to/cellranger-arc/2.0.1/cellranger-arc-2.0.1/lib/python/cellranger/barcodes/737K-arc-v1.txt.gz) <(zcat /path/to/cellranger-arc/2.0.1/cellranger-arc-2.0.1/lib/python/atac/barcodes/737K-arc-v1.txt.gz) > 737K-arc_RNA_ATAC_barcodeTranslation.tsv 

My chromap execution command was then:

chromap \
    -t 4 \
    --preset atac \
    -x hg38.chromap.index \
    -r hg38.fa \
    -1 *WT*R1* \
    -2 *WT*R2* \
    -o WT_ATAC_chromap_fragments.tsv \
    -b *WT*I2* \
    --barcode-translate 737K-arc_RNA_ATAC_barcodeTranslation.tsv

I thought that perhaps there was a reverse-complement issue, so I also tried the same command with

--read-format bc:0:-1:-

added but that resulted in the same error.

Looking at my I2 file:

$ zcat *WT*I2* | head
@VH00562:10:AACF7JYM5:1:1101:62086:1000:rCAGACGCGAACGGACCTACCTGAT 2:N:0:CGACATAG
CAGACGCGAACGGACCTACCTGAT
+
CCCCCCCCCCCCCCCCCCCCCCCC
@VH00562:10:AACF7JYM5:1:1101:62199:1000:rCAGACGCGAGGTTACACAAGCGTT 2:N:0:ATGGTCGC
CAGACGCGAGGTTACACAAGCGTT
+
-CCCCCCC;CCC;-CCCC;CCCCC
@VH00562:10:AACF7JYM5:1:1101:62654:1000:rCAGACGCGTTACCTCCTGAGGTAC 2:N:0:ATGGTCGC
CAGACGCGTTACCTCCTGAGGTAC

And now looking at the translation table:

$ head 737K-arc_RNA_ATAC_barcodeTranslation.tsv
AAACAGCCAAACAACA    ACAGCGGGTGTGTTAC
AAACAGCCAAACATAG    ACAGCGGGTTGTTCTT
AAACAGCCAAACCCTA    ACAGCGGGTAACAGGC
AAACAGCCAAACCTAT    ACAGCGGGTGCGCGAA
AAACAGCCAAACCTTG    ACAGCGGGTCCTCCAT
AAACAGCCAAACGCGA    ACAGCGGGTCATGGTT
AAACAGCCAAACGGGC    ACAGCGGGTAGGTGAC
AAACAGCCAAACTAAG    ACAGCGGGTTACCCAG
AAACAGCCAAACTCAT    ACAGCGGGTCAGTGCC
AAACAGCCAAACTGCC    ACAGCGGGTGTTGCAT

It's clear there is a length mismatch (translation table barcodes = 16nt, I2 barcodes = 24nt), but given the I1/I2 architecture and not having an R3 file, I'm not exactly sure how to proceed, and the temp files output by chromap are not human-readable to tell what it's looking at.

Any insight or guidance here would be much appreciated!

Thanks!

mourisl commented 2 years ago

Could you please add the option "--read-format bc:0:15" to use the first 16bp from the barcode file sequence as the barcode? Please let me know whether this works.

jeremymsimon commented 2 years ago

Thanks @mourisl, I will try that. Do I need that in addition to --read-format bc:0:-1:- or instead of?

mourisl commented 2 years ago

Is the barcode reverse complemented?

jeremymsimon commented 2 years ago

Yes, I believe so. What would be the proper syntax in that case? And does that mean I need the last 15nt or does the reverse complement happen first?

mourisl commented 2 years ago

It could be the last 16nt. You can try --read-format bc:0:15:- or --read-format bc:8:-1:- to see which one works. Or you can manually inspect some of the barcodes in the I2 file to decide.

jeremymsimon commented 2 years ago

Hmm neither of those seemed to work, same error. The first 8nt of the I2 file are almost all the same (see original post snippet), and the final 16nt do need to be reverse complemented. What is the order of events here? Does it reverse complement and then cut down to the bp range specified, or does it cut and then reverse complement?

mourisl commented 2 years ago

In Chromap, it extracts first and then takes the reverse complement. Could you please manually cut+reverse complement some of the barcodes to check whether they show up in the whitelist? Thank you.

jeremymsimon commented 2 years ago

Yes, if I extract the final 16nt, then reverse complement, I see those matching barcodes in the second column of my translation table. Without reverse complementing, there is no match. See below:

$ zcat *WT*I2* | head -n 100000 | awk 'NR%4==2' | cut -c 9- | head
AACGGACCTACCTGAT
AGGTTACACAAGCGTT
TTACCTCCTGAGGTAC
TGGGCAATGGCAATTT
GTGACCTTGCACCTTC
TCTGTGATGTTCTTAC
GACCAAGCTGGTTAGG
GCAACTAACCTGGTCC
CATCCTGGATGGCTAA
CCGATTATGCTACTAG
$ grep AACGGACCTACCTGAT 737K-arc_RNA_ATAC_barcodeTranslation.tsv
$ grep AGGTTACACAAGCGTT 737K-arc_RNA_ATAC_barcodeTranslation.tsv 
$ grep TTACCTCCTGAGGTAC 737K-arc_RNA_ATAC_barcodeTranslation.tsv 

$ zcat *WT*I2* | head -n 100000 | awk 'NR%4==2' | cut -c 9- | tr ACGT TGCA | rev | head
ATCAGGTAGGTCCGTT
AACGCTTGTGTAACCT
GTACCTCAGGAGGTAA
AAATTGCCATTGCCCA
GAAGGTGCAAGGTCAC
GTAAGAACATCACAGA
CCTAACCAGCTTGGTC
GGACCAGGTTAGTTGC
TTAGCCATCCAGGATG
CTAGTAGCATAATCGG
$ grep ATCAGGTAGGTCCGTT 737K-arc_RNA_ATAC_barcodeTranslation.tsv 
CATTATCTCCTAGTTT    ATCAGGTAGGTCCGTT
$ grep AACGCTTGTGTAACCT 737K-arc_RNA_ATAC_barcodeTranslation.tsv 
AGGTTAACATCAATCG    AACGCTTGTGTAACCT
$ grep GTACCTCAGGAGGTAA 737K-arc_RNA_ATAC_barcodeTranslation.tsv 
CGGTTTCTCATTGTTC    GTACCTCAGGAGGTAA

So by that logic, I should be able to use --read-format bc:8:-1:- but it fails with the same error as above

Is there something else I'm missing here?

mourisl commented 2 years ago

Could you please run chromap without --barcode-translate, and check whether the extracted barcodes make sense?

jeremymsimon commented 2 years ago

They do, they are mostly exact matches to the translation table. There are some exceptions, however, where a resulting sequence does not have a match in the translation table and perhaps this is where chromap is getting stuck. Do the parameters --bc-error-threshold --bc-probability-threshold also apply to the translation table or is that only for a supplied inclusion list? I'll try a run now with --barcode-translate and --output-mappings-not-in-whitelist specified

jeremymsimon commented 2 years ago

Update: it seems this ignores --output-mappings-not-in-whitelist. It still fails with the same error

haowenz commented 2 years ago

Is this dataset publicly available? Or are you able to provide a small sample for us to have a look?

jeremymsimon commented 2 years ago

Here are the first 1M lines of each of the FASTQs, and the barcode translation table I'm using WT_head_R1.fastq.gz WT_head_R2.fastq.gz WT_head_I2.fastq.gz RNA_ATAC_barcodeTranslation.tsv.gz

To reproduce, you should be able to run with

chromap \
    -t 4 \
    --preset atac \
    -x hg38.chromap.index \
    -r hg38.fa \
    --read-format bc:8:-1:- \
    -1 WT*R1* \
    -2 WT*R2* \
    -o WT_head_ATAC_chromap_fragments.tsv \
    -b WT*I2* \
    --barcode-translate RNA_ATAC_barcodeTranslation.tsv

With this small-scale test, you'll see that some fragments do get written, but only 21 lines before chromap reports the error and dies with the same error as above.

swiftgenomics commented 2 years ago

Sorry for the delay. I tried your test data and Chromap finished without any errors. So which version of Chromap were you using? Can you run "chromap -v" and check? If you were not using the latest version, please try the latest version and see if that resolve your issue.

jeremymsimon commented 2 years ago

Thanks for looking into it. I was running 0.2.3-r424, but now updated to 0.2.3-r452 and I got the same error. Did you run the exact same code I have above or did you change a parameter setting? Or is there something that could be influenced by a dependency somewhere?

swiftgenomics commented 2 years ago

0.2.3-r452 is not a fix for your problem. I tested with the data and the identical command line you provided. What is your operating system? It looks like you compiled and ran Chromap right? What's your compiler and its version?

jeremymsimon commented 2 years ago

This is on our compute cluster, which is red hat linux v8.5. By default I have the following loaded, but they can be changed however we need:

$ gcc --version
gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-4)

$ cc --version
cc (GCC) 9.1.0

$ cmake --version
cmake version 3.20.2

$ make --version
GNU Make 4.2.1

Let me know if that helps or if you're looking for something else

swiftgenomics commented 2 years ago

This is weird. Did you run the same command line with the test data you provided? And the error message is "Barcode does not exist in the translation table"?

jeremymsimon commented 2 years ago

Yes, same command, and same error. Here's the complete log:

Preset parameters for ATAC-seq/scATAC-seq are used.
Start to map reads.
WARNING: there are input barcode files but a barcode whitelist file is missing!
Parameters: error threshold: 8, min-num-seeds: 2, max-seed-frequency: 500,1000, max-num-best-mappings: 1, max-insert-size: 2000, MAPQ-threshold: 30, min-read-length: 30, bc-error-threshold: 1, bc-probability-threshold: 0.90
Number of threads: 4
Analyze single-cell data.
Will try to remove adapters on 3'.
Will remove PCR duplicates after mapping.
Will remove PCR duplicates at cell level.
Won't allocate multi-mappings after mapping.
Only output unique mappings after mapping.
Only output mappings of which barcodes are in whitelist.
Perform Tn5 shift.
Output mappings in BED/BEDPE format.
Reference file: hg38.fa
Index file: hg38.chromap.index
1th read 1 file: WT_head_R1.fastq.gz
1th read 2 file: WT_head_R2.fastq.gz
1th cell barcode file: WT_head_I2.fastq.gz
Output file: WT_head_ATAC_chromap_fragments.tsv
Loaded all sequences successfully in 12.88s, number of sequences: 455, number of bases: 3209286105.
Kmer size: 17, window size: 7.
Lookup table size: 393376326, occurrence table size: 478581398.
Loaded index successfully in 38.87s.
Loaded sequence batch successfully in 0.02s, number of sequences: 1000, number of bases: 24000.
Mapped 250000 read pairs in 2.64s.
Mapped all reads in 4.53s.
Number of reads: 500000.
Number of mapped reads: 475426.
Number of uniquely mapped reads: 441016.
Number of reads have multi-mappings: 34410.
Number of candidates: 3919363.
Number of mappings: 475426.
Number of uni-mappings: 441016.
Number of multi-mappings: 34410.
Number of barcodes in whitelist: 0.
Number of corrected barcodes: 0.
Barcode does not exist in the translation table.
swiftgenomics commented 1 year ago

Now I know what happened. You are doing barcode translation. But you didn't specify a barcode whitelist file. Then if there are any barcodes that are not in the whitelist, they cannot be translated and caused the error. Can you also input a barcode whitelist file? That should resolve your issue.

jeremymsimon commented 1 year ago

Ohh okay. I can try that. Since the translation table is derived from the two whitelists, shouldn't chromap just use one column of the table instead of requiring an additional file?

swiftgenomics commented 1 year ago

Ohh okay. I can try that. Since the translation table is derived from the two whitelists, shouldn't chromap just use one column of the table instead of requiring an additional file?

We didn't implement that feature. We plan to add an error check to terminate the run in this case.

jeremymsimon commented 1 year ago

I additionally specified

--barcode-whitelist 737K-arc-v1.txt

as you suggested, but now get the error

Less than 5% barcodes can be found or corrected based on the barcode whitelist.
Please check whether the barcode whitelist matches the data, e.g. length, reverse-complement. If this is a false warning, please run Chromap with the option --skip-barcode-check.

As suggested by @badoi in #15, I tried taking the reverse-complement of the whitelisted sequences, but the same error remained.

It's not clear to me what may be happening, especially considering my parameter --read-format bc:8:-1:- should account for the reverse complement and taking the last 16nt as we discussed above. Was the patch for issue #15 incorporated into the master branch, or should I try some alternative branch to test?

badoi commented 1 year ago

Hi @jeremymsimon, I also used the 10x genomics multiomics data with chromap and I got the barcode translation to work with the following command. I had to hand-craft the cell barcode fastq with relatively fast seqtk package. it's fairly fast, a couple minutes tops for 100Gb set of fastqs. I avoided trying out the --read-format bc:8:-1:- commands because I didn't know about it. might still save the trimming step if it works. Maybe it will with the combination of whitelist and translation files.

I had to reverse complement the ATAC barcode whitelist and make a translation file for both forward and reverse complement of the multiome ATAC barcodes to the multiome RNA barcodes. I also attached the ATAC barcode whitelist, so use whichever one works for you. I tested this and chromap will spit out cell barcodes in the fragments.tsv file that will match the multiome RNA barcodes output by cell ranger. The files are attached, gzipped, so unzip and align with abandon.

Note to not use --output-mappings-not-in-whitelist with multiome barcode translation because the uncorrected barcodes will throw Barcode does not exist in the translation table. error during the look up part. Chromap won't know how to "translate" to RNA barcode an ATAC barcode without any matches in the table. This took some debugging and a couple days to figure out. To get the chromap barcode translation to work there has to be the correct orientation whitelist to exclude barcodes that wouldn't match the translation dictionary. else it crashes at the last step. Previously I didn't using the translation, just the whitelist, and did the vlookup myself with bash scripts, before chromap had this functionality.

Let me know if this works. 10x was pretty opaque about the whole RNA-ATAC barcodes in the multiome kit being different and had 1-1 matches, so I'm glad y'all figured that out.

## declare the barcode include list and ATAC2RNA translation file
BARCODE_LIST=737K-arc-v1_ATAC_revComp.txt
BARCODE_TRANSLATE=737K-arc-v1_ATAC2RNA_forChromapBarcodeTranslate.txt

## trim the left 8bp with seqtk into a new file
for BC in $(ls *${SAMPLE_ID}*_R2_001.fastq.gz); do
  BCX=$(basename $BC _R2_001.fastq.gz)_R2_trimmed.fastq.gz
  if [ ! -f $BCX ]; then seqtk trimfq -b 8 $BC | gzip > $BCX; fi
 done

## collect the barcode files, read 1 or 2, with commas
BC1=$(ls *${SAMPLE_ID}*_R2_trimmed.fastq.gz | tr '\n' ',' | sed 's/,$//g')
FQ1=$(ls *${SAMPLE_ID}*_R1_001.fastq.gz | tr '\n' ',' | sed 's/,$//g')
FQ2=$(ls *${SAMPLE_ID}*_R3_001.fastq.gz | tr '\n' ',' | sed 's/,$//g' )

## run chromap w/ barcode fixing and translation
chromap --preset atac -t 8 \
-x $CHROMAP_IDX -r ${GENOME_FASTA} \
--remove-pcr-duplicates-at-cell-level \
--bc-error-threshold 2 \
--bc-probability-threshold .9 \
-1 ${FQ1} -2 ${FQ2} -b ${BC1} -o ${ALIGN_BED} \
--barcode-whitelist ${BARCODE_LIST} \
--barcode-translate ${BARCODE_TRANSLATE}

737K-arc-v1_ATAC.txt.gz 737K-arc-v1_ATAC_revComp.txt.gz 737K-arc-v1_ATAC2RNA_forChromapBarcodeTranslate.txt.gz

haowenz commented 1 year ago

I additionally specified

--barcode-whitelist 737K-arc-v1.txt

as you suggested, but now get the error

Less than 5% barcodes can be found or corrected based on the barcode whitelist.
Please check whether the barcode whitelist matches the data, e.g. length, reverse-complement. If this is a false warning, please run Chromap with the option --skip-barcode-check.

As suggested by @badoi in #15, I tried taking the reverse-complement of the whitelisted sequences, but the same error remained.

It's not clear to me what may be happening, especially considering my parameter --read-format bc:8:-1:- should account for the reverse complement and taking the last 16nt as we discussed above. Was the patch for issue #15 incorporated into the master branch, or should I try some alternative branch to test?

That fix was merged into the master. This warning means that most of the barcodes are not in the whitelist, which is not expected. Are you sure the whitelist file 737K-arc-v1.txt you were using is for scATAC-seq? The whitelist files for scRNA-seq and scATAC-seq share the same name.

jeremymsimon commented 1 year ago

Thanks @badoi and @haowenz. I inadvertently grabbed the RNA and not the ATAC whitelist before as you caught.

So now using the appropriate ATAC whitelist (used as-is, not reverse-complemented), with the following command:

chromap \
    -t 4 \
    --preset atac \
    -x hg38.chromap.index \
    -r hg38.fa \
    --read-format bc:8:-1:- \
    -1 WT*R1* \
    -2 WT*R2* \
    -o WT_head_ATAC_chromap_fragments.tsv \
    -b WT*I2* \
    --barcode-whitelist 737K-arc-v1.txt \
    --barcode-translate 737K-arc_RNA_ATAC_barcodeTranslation.tsv

this completes without error and with the following stats:

Mapped 250000 read pairs in 2.34s.
Mapped all reads in 3.99s.
Number of reads: 500000.
Number of mapped reads: 445484.
Number of uniquely mapped reads: 413430.
Number of reads have multi-mappings: 32054.
Number of candidates: 3659362.
Number of mappings: 445484.
Number of uni-mappings: 413430.
Number of multi-mappings: 32054.
Number of barcodes in whitelist: 229375.
Number of corrected barcodes: 4662.
Sorted, deduped and outputed mappings in 3.97s.
# uni-mappings: 161280, # multi-mappings: 14084, total: 175364.
Number of output mappings (passed filters): 156249

So as long as that all looks correct and matches what you saw on your own system, then perhaps this solution is the simpler workaround that doesn't require pre-processing as @badoi outlined above.

I will close this for now. If I encounter any issues downstream related to barcode translation I will let you know.

It would be great to add notes of this to the documentation for future users as well.

Thanks!