Closed jeremymsimon closed 1 year ago
Could you please add the option "--read-format bc:0:15" to use the first 16bp from the barcode file sequence as the barcode? Please let me know whether this works.
Thanks @mourisl, I will try that. Do I need that in addition to --read-format bc:0:-1:-
or instead of?
Is the barcode reverse complemented?
Yes, I believe so. What would be the proper syntax in that case? And does that mean I need the last 15nt or does the reverse complement happen first?
It could be the last 16nt. You can try --read-format bc:0:15:-
or
--read-format bc:8:-1:-
to see which one works. Or you can manually inspect some of the barcodes in the I2 file to decide.
Hmm neither of those seemed to work, same error. The first 8nt of the I2
file are almost all the same (see original post snippet), and the final 16nt do need to be reverse complemented. What is the order of events here? Does it reverse complement and then cut down to the bp range specified, or does it cut and then reverse complement?
In Chromap, it extracts first and then takes the reverse complement. Could you please manually cut+reverse complement some of the barcodes to check whether they show up in the whitelist? Thank you.
Yes, if I extract the final 16nt, then reverse complement, I see those matching barcodes in the second column of my translation table. Without reverse complementing, there is no match. See below:
$ zcat *WT*I2* | head -n 100000 | awk 'NR%4==2' | cut -c 9- | head
AACGGACCTACCTGAT
AGGTTACACAAGCGTT
TTACCTCCTGAGGTAC
TGGGCAATGGCAATTT
GTGACCTTGCACCTTC
TCTGTGATGTTCTTAC
GACCAAGCTGGTTAGG
GCAACTAACCTGGTCC
CATCCTGGATGGCTAA
CCGATTATGCTACTAG
$ grep AACGGACCTACCTGAT 737K-arc_RNA_ATAC_barcodeTranslation.tsv
$ grep AGGTTACACAAGCGTT 737K-arc_RNA_ATAC_barcodeTranslation.tsv
$ grep TTACCTCCTGAGGTAC 737K-arc_RNA_ATAC_barcodeTranslation.tsv
$ zcat *WT*I2* | head -n 100000 | awk 'NR%4==2' | cut -c 9- | tr ACGT TGCA | rev | head
ATCAGGTAGGTCCGTT
AACGCTTGTGTAACCT
GTACCTCAGGAGGTAA
AAATTGCCATTGCCCA
GAAGGTGCAAGGTCAC
GTAAGAACATCACAGA
CCTAACCAGCTTGGTC
GGACCAGGTTAGTTGC
TTAGCCATCCAGGATG
CTAGTAGCATAATCGG
$ grep ATCAGGTAGGTCCGTT 737K-arc_RNA_ATAC_barcodeTranslation.tsv
CATTATCTCCTAGTTT ATCAGGTAGGTCCGTT
$ grep AACGCTTGTGTAACCT 737K-arc_RNA_ATAC_barcodeTranslation.tsv
AGGTTAACATCAATCG AACGCTTGTGTAACCT
$ grep GTACCTCAGGAGGTAA 737K-arc_RNA_ATAC_barcodeTranslation.tsv
CGGTTTCTCATTGTTC GTACCTCAGGAGGTAA
So by that logic, I should be able to use --read-format bc:8:-1:-
but it fails with the same error as above
Is there something else I'm missing here?
Could you please run chromap without --barcode-translate, and check whether the extracted barcodes make sense?
They do, they are mostly exact matches to the translation table. There are some exceptions, however, where a resulting sequence does not have a match in the translation table and perhaps this is where chromap is getting stuck. Do the parameters --bc-error-threshold
--bc-probability-threshold
also apply to the translation table or is that only for a supplied inclusion list? I'll try a run now with --barcode-translate
and --output-mappings-not-in-whitelist
specified
Update: it seems this ignores --output-mappings-not-in-whitelist
. It still fails with the same error
Is this dataset publicly available? Or are you able to provide a small sample for us to have a look?
Here are the first 1M lines of each of the FASTQs, and the barcode translation table I'm using WT_head_R1.fastq.gz WT_head_R2.fastq.gz WT_head_I2.fastq.gz RNA_ATAC_barcodeTranslation.tsv.gz
To reproduce, you should be able to run with
chromap \
-t 4 \
--preset atac \
-x hg38.chromap.index \
-r hg38.fa \
--read-format bc:8:-1:- \
-1 WT*R1* \
-2 WT*R2* \
-o WT_head_ATAC_chromap_fragments.tsv \
-b WT*I2* \
--barcode-translate RNA_ATAC_barcodeTranslation.tsv
With this small-scale test, you'll see that some fragments do get written, but only 21 lines before chromap reports the error and dies with the same error as above.
Sorry for the delay. I tried your test data and Chromap finished without any errors. So which version of Chromap were you using? Can you run "chromap -v" and check? If you were not using the latest version, please try the latest version and see if that resolve your issue.
Thanks for looking into it. I was running 0.2.3-r424
, but now updated to 0.2.3-r452
and I got the same error. Did you run the exact same code I have above or did you change a parameter setting? Or is there something that could be influenced by a dependency somewhere?
0.2.3-r452 is not a fix for your problem. I tested with the data and the identical command line you provided. What is your operating system? It looks like you compiled and ran Chromap right? What's your compiler and its version?
This is on our compute cluster, which is red hat linux v8.5. By default I have the following loaded, but they can be changed however we need:
$ gcc --version
gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-4)
$ cc --version
cc (GCC) 9.1.0
$ cmake --version
cmake version 3.20.2
$ make --version
GNU Make 4.2.1
Let me know if that helps or if you're looking for something else
This is weird. Did you run the same command line with the test data you provided? And the error message is "Barcode does not exist in the translation table"?
Yes, same command, and same error. Here's the complete log:
Preset parameters for ATAC-seq/scATAC-seq are used.
Start to map reads.
WARNING: there are input barcode files but a barcode whitelist file is missing!
Parameters: error threshold: 8, min-num-seeds: 2, max-seed-frequency: 500,1000, max-num-best-mappings: 1, max-insert-size: 2000, MAPQ-threshold: 30, min-read-length: 30, bc-error-threshold: 1, bc-probability-threshold: 0.90
Number of threads: 4
Analyze single-cell data.
Will try to remove adapters on 3'.
Will remove PCR duplicates after mapping.
Will remove PCR duplicates at cell level.
Won't allocate multi-mappings after mapping.
Only output unique mappings after mapping.
Only output mappings of which barcodes are in whitelist.
Perform Tn5 shift.
Output mappings in BED/BEDPE format.
Reference file: hg38.fa
Index file: hg38.chromap.index
1th read 1 file: WT_head_R1.fastq.gz
1th read 2 file: WT_head_R2.fastq.gz
1th cell barcode file: WT_head_I2.fastq.gz
Output file: WT_head_ATAC_chromap_fragments.tsv
Loaded all sequences successfully in 12.88s, number of sequences: 455, number of bases: 3209286105.
Kmer size: 17, window size: 7.
Lookup table size: 393376326, occurrence table size: 478581398.
Loaded index successfully in 38.87s.
Loaded sequence batch successfully in 0.02s, number of sequences: 1000, number of bases: 24000.
Mapped 250000 read pairs in 2.64s.
Mapped all reads in 4.53s.
Number of reads: 500000.
Number of mapped reads: 475426.
Number of uniquely mapped reads: 441016.
Number of reads have multi-mappings: 34410.
Number of candidates: 3919363.
Number of mappings: 475426.
Number of uni-mappings: 441016.
Number of multi-mappings: 34410.
Number of barcodes in whitelist: 0.
Number of corrected barcodes: 0.
Barcode does not exist in the translation table.
Now I know what happened. You are doing barcode translation. But you didn't specify a barcode whitelist file. Then if there are any barcodes that are not in the whitelist, they cannot be translated and caused the error. Can you also input a barcode whitelist file? That should resolve your issue.
Ohh okay. I can try that. Since the translation table is derived from the two whitelists, shouldn't chromap
just use one column of the table instead of requiring an additional file?
Ohh okay. I can try that. Since the translation table is derived from the two whitelists, shouldn't
chromap
just use one column of the table instead of requiring an additional file?
We didn't implement that feature. We plan to add an error check to terminate the run in this case.
I additionally specified
--barcode-whitelist 737K-arc-v1.txt
as you suggested, but now get the error
Less than 5% barcodes can be found or corrected based on the barcode whitelist.
Please check whether the barcode whitelist matches the data, e.g. length, reverse-complement. If this is a false warning, please run Chromap with the option --skip-barcode-check.
As suggested by @badoi in #15, I tried taking the reverse-complement of the whitelisted sequences, but the same error remained.
It's not clear to me what may be happening, especially considering my parameter --read-format bc:8:-1:-
should account for the reverse complement and taking the last 16nt as we discussed above. Was the patch for issue #15 incorporated into the master branch, or should I try some alternative branch to test?
Hi @jeremymsimon, I also used the 10x genomics multiomics data with chromap and I got the barcode translation to work with the following command. I had to hand-craft the cell barcode fastq with relatively fast seqtk package. it's fairly fast, a couple minutes tops for 100Gb set of fastqs. I avoided trying out the --read-format bc:8:-1:-
commands because I didn't know about it. might still save the trimming step if it works. Maybe it will with the combination of whitelist and translation files.
I had to reverse complement the ATAC barcode whitelist and make a translation file for both forward and reverse complement of the multiome ATAC barcodes to the multiome RNA barcodes. I also attached the ATAC barcode whitelist, so use whichever one works for you. I tested this and chromap will spit out cell barcodes in the fragments.tsv file that will match the multiome RNA barcodes output by cell ranger. The files are attached, gzipped, so unzip and align with abandon.
Note to not use --output-mappings-not-in-whitelist
with multiome barcode translation because the uncorrected barcodes will throw Barcode does not exist in the translation table.
error during the look up part. Chromap won't know how to "translate" to RNA barcode an ATAC barcode without any matches in the table. This took some debugging and a couple days to figure out. To get the chromap barcode translation to work there has to be the correct orientation whitelist to exclude barcodes that wouldn't match the translation dictionary. else it crashes at the last step. Previously I didn't using the translation, just the whitelist, and did the vlookup myself with bash scripts, before chromap had this functionality.
Let me know if this works. 10x was pretty opaque about the whole RNA-ATAC barcodes in the multiome kit being different and had 1-1 matches, so I'm glad y'all figured that out.
## declare the barcode include list and ATAC2RNA translation file
BARCODE_LIST=737K-arc-v1_ATAC_revComp.txt
BARCODE_TRANSLATE=737K-arc-v1_ATAC2RNA_forChromapBarcodeTranslate.txt
## trim the left 8bp with seqtk into a new file
for BC in $(ls *${SAMPLE_ID}*_R2_001.fastq.gz); do
BCX=$(basename $BC _R2_001.fastq.gz)_R2_trimmed.fastq.gz
if [ ! -f $BCX ]; then seqtk trimfq -b 8 $BC | gzip > $BCX; fi
done
## collect the barcode files, read 1 or 2, with commas
BC1=$(ls *${SAMPLE_ID}*_R2_trimmed.fastq.gz | tr '\n' ',' | sed 's/,$//g')
FQ1=$(ls *${SAMPLE_ID}*_R1_001.fastq.gz | tr '\n' ',' | sed 's/,$//g')
FQ2=$(ls *${SAMPLE_ID}*_R3_001.fastq.gz | tr '\n' ',' | sed 's/,$//g' )
## run chromap w/ barcode fixing and translation
chromap --preset atac -t 8 \
-x $CHROMAP_IDX -r ${GENOME_FASTA} \
--remove-pcr-duplicates-at-cell-level \
--bc-error-threshold 2 \
--bc-probability-threshold .9 \
-1 ${FQ1} -2 ${FQ2} -b ${BC1} -o ${ALIGN_BED} \
--barcode-whitelist ${BARCODE_LIST} \
--barcode-translate ${BARCODE_TRANSLATE}
737K-arc-v1_ATAC.txt.gz 737K-arc-v1_ATAC_revComp.txt.gz 737K-arc-v1_ATAC2RNA_forChromapBarcodeTranslate.txt.gz
I additionally specified
--barcode-whitelist 737K-arc-v1.txt
as you suggested, but now get the error
Less than 5% barcodes can be found or corrected based on the barcode whitelist. Please check whether the barcode whitelist matches the data, e.g. length, reverse-complement. If this is a false warning, please run Chromap with the option --skip-barcode-check.
As suggested by @badoi in #15, I tried taking the reverse-complement of the whitelisted sequences, but the same error remained.
It's not clear to me what may be happening, especially considering my parameter
--read-format bc:8:-1:-
should account for the reverse complement and taking the last 16nt as we discussed above. Was the patch for issue #15 incorporated into the master branch, or should I try some alternative branch to test?
That fix was merged into the master. This warning means that most of the barcodes are not in the whitelist, which is not expected. Are you sure the whitelist file 737K-arc-v1.txt
you were using is for scATAC-seq? The whitelist files for scRNA-seq and scATAC-seq share the same name.
Thanks @badoi and @haowenz. I inadvertently grabbed the RNA and not the ATAC whitelist before as you caught.
So now using the appropriate ATAC whitelist (used as-is, not reverse-complemented), with the following command:
chromap \
-t 4 \
--preset atac \
-x hg38.chromap.index \
-r hg38.fa \
--read-format bc:8:-1:- \
-1 WT*R1* \
-2 WT*R2* \
-o WT_head_ATAC_chromap_fragments.tsv \
-b WT*I2* \
--barcode-whitelist 737K-arc-v1.txt \
--barcode-translate 737K-arc_RNA_ATAC_barcodeTranslation.tsv
this completes without error and with the following stats:
Mapped 250000 read pairs in 2.34s.
Mapped all reads in 3.99s.
Number of reads: 500000.
Number of mapped reads: 445484.
Number of uniquely mapped reads: 413430.
Number of reads have multi-mappings: 32054.
Number of candidates: 3659362.
Number of mappings: 445484.
Number of uni-mappings: 413430.
Number of multi-mappings: 32054.
Number of barcodes in whitelist: 229375.
Number of corrected barcodes: 4662.
Sorted, deduped and outputed mappings in 3.97s.
# uni-mappings: 161280, # multi-mappings: 14084, total: 175364.
Number of output mappings (passed filters): 156249
So as long as that all looks correct and matches what you saw on your own system, then perhaps this solution is the simpler workaround that doesn't require pre-processing as @badoi outlined above.
I will close this for now. If I encounter any issues downstream related to barcode translation I will let you know.
It would be great to add notes of this to the documentation for future users as well.
Thanks!
Hi, I have a 10x multiomics (RNA+ATAC) experiment, and am trying to use
chromap
for the ATAC portion and associate it with RNA analyzed byalevin-fry
. My ATAC files are in the format withR1
,R2
,I1
, andI2
, and I made a barcode translation table to map the ATAC barcodes to the matched RNA sequences. Unfortunatelychromap
is failing with the errorBarcode does not exist in the translation table
and I can't quite seem to figure out the right configuration.My barcode translation table was made as follows:
My chromap execution command was then:
I thought that perhaps there was a reverse-complement issue, so I also tried the same command with
added but that resulted in the same error.
Looking at my
I2
file:And now looking at the translation table:
It's clear there is a length mismatch (translation table barcodes = 16nt, I2 barcodes = 24nt), but given the
I1/I2
architecture and not having anR3
file, I'm not exactly sure how to proceed, and the temp files output bychromap
are not human-readable to tell what it's looking at.Any insight or guidance here would be much appreciated!
Thanks!