haowenz / chromap

Fast alignment and preprocessing of chromatin profiles
https://haowenz.github.io/chromap/
MIT License
191 stars 21 forks source link

very low number of mapped read when using --read-format #136

Open nouroddinc opened 1 year ago

nouroddinc commented 1 year ago

Hi chromap team, Thank you so much for preparing this helpful tools and I'm very impressed by its performances. I am testing it on my scATAC-seq data and it works pretty good when I don't use --read-format.

I have a scATAC-seq with R1 and R2 to be mapped and cell barcodes in R2 ( at positions bc1:22:29 and bc2:60-67)

Looking at my R2 file:

# zcat read_R2.fastq.gz | head

@lh00134:68:225MHKLT3:7:1101:1157:1080 2:N:0:TAAGGC
CACGCGTTGACTTCTCGCATCTCAACCACAATCCACGTGCTTGAGAGGCCAGAGCATTCGCGACTGGAGTGGCCGATGTTTCGCATCGGCGTACGACTAGATGTGTATAAGAGACAGGCCTAGTCGGCCATCATAGGAATGAGAGGCCCT
+
---55-FFF-F-FFFFFFFF5FFFF5555555-555555-5-555F-FFFFFFFF-FFFFFFFFFFFFFFFFFFFFFFF-FFFFFFFFFFFFFFFFF-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF5FFFFFF5FF
@lh00134:68:225MHKLT3:7:1101:1583:1080 2:N:0:TAAGGC
CAAGCGTTGACTTCTCGCATCTCCTCCTGAATCCACGTGCTTGAGAGGCCAGAGCATTCGGACTAGTAGTGGCCGATGTTTCGCATCGGCGTACGACTAGATGTGTATAAGAGACAGGCACTATAGGGGAGCGTCCGAGTTTCAAACAAA
+
--F5F55-F--F5-F-FFFFFFFFFFFFFFFFFFFFF-FFF-FFFF-FFFFFFFFFF-FFFFFFFFFFFFFF-FFFFFFFFFF5FFFFFFFFFFFFFFFFFFFFF5FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@lh00134:68:225MHKLT3:7:1101:1619:1080 2:N:0:TAAGGC
CAAGCGCTGGCTTCTCGCATCTCAGCGTTAATCCACGTGCTTGAGAGGCCAGAGCATTCGACCACTGTGTGGCCGATGTTTCGCATCGGCGTACGACTAGATGTGTATAAGAGACAGGCCCTGACCCTAGCGGGGAGGGTGAGGAAGCTG

using the followoing command:

chromap  \
          --preset atac \
          -x GRCm38_genome.index  \
          -r GRCm38_genome.fa \
          -1 read_R1.fastq.gz \
          -2 read_R2.fastq.gz \
          -o aln.bed \
          -b read_R2.fastq.gz \
          --read-format bc:60:67,bc:22:29,r1:0:-1,r2:117:-1 \
          -t 10 \
          --barcode-whitelist bc50.txt \
          2> chromap_with_read_format.log

gives me very low number of the mapped reads as the chromap_with_read_format.log shows.

but if I extract barcodes from read_R2.fq.gz into read_R3.fq and leave the genome in read_R2.fq as a new file like the below:

# cat read_R2.fastq | head
@lh00134:68:225MHKLT3:7:1101:1157:1080 2:N:0:TAAGGC
GCCTAGTCGGCCATCATAGGAATGAGAGGCCCT
+
FFFFFFFFFFFFFFFFFFFFFFF5FFFFFF5FF
@lh00134:68:225MHKLT3:7:1101:1583:1080 2:N:0:TAAGGC
GCACTATAGGGGAGCGTCCGAGTTTCAAACAAA
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@lh00134:68:225MHKLT3:7:1101:1619:1080 2:N:0:TAAGGC
GCCCTGACCCTAGCGGGGAGGGTGAGGAAGCTG

and

# cat read_R3.fastq | head
CAACCACACGACTGGA
+
FFF55555FFFFFFFF
@lh00134:68:225MHKLT3:7:1101:1583:1080 2:N:0:TAAGGC
CCTCCTGAGACTAGTA
+
FFFFFFFFFFFFFFFF
@lh00134:68:225MHKLT3:7:1101:1619:1080 2:N:0:TAAGGC
CAGCGTTAACCACTGT

and then run like this:

chromap  \
          --preset atac \
          -x GRCm38_genome.index  \
          -r GRCm38_genome.fa \
          -1 read_R1.fastq.gz \
          -2 read_R2.fq \
          -o aln.bed \
          -b read_R3.fq \
          -t 10 \
          --barcode-whitelist bc50.txt \
          2> chromap_without_read_format.log

it works very good as log file chromap_without_read_format.log shows. my chromap version is

# ./chromap/chromap -v
0.2.5-r473

and this is my barcode whitelist file:

# cat bc50.txt | head
AACGTGATAACGTGAT
AACGTGATGCGAGTAA
AACGTGATATGCCTAA
AACGTGATGCTAACGA
AACGTGATACCACTGT
AACGTGATACATTGGC
AACGTGATCAGATCTG
AACGTGATCATCAAGT
AACGTGATCGCTGATC
AACGTGATACAAGCTA

I would appreciate if you could help me to find what i am doing wrong because using --read-format argument can make it much faster and easier to run my pipeline.

many Thanks, Noori

haowenz commented 1 year ago

@mourisl Is this way of using read-format supported?

mourisl commented 1 year ago

@nouroddinc Is your barcode starting from 60-67 followed by 22-29? Chromap will internally sort it to 22-29 followed by 60-67 to avoid typos in ranges. Maybe this is an undesirable feature now.

nouroddinc commented 1 year ago

@haowenz @mourisl thanks for helping me on this. yes, my barcode starts from 60-67 and ends with 22-29. I have tested with --read-format bc:22:29,bc:60:67,r1:0:-1,r2:117:-1 and the output log file and bed file is exactly the same as when I tested --read-format bc:60:67,bc:22:29,r1:0:-1,r2:117:-1 . I should also mention that my barcode file bc50.txt includes both barcodes bc:22:29+60:67 and bc:60:67+22:29 in some cases. For example

# cat bc50.txt | head
AACGTGATAACGTGAT
AACGTGATGCGAGTAA
AACGTGATATGCCTAA
AACGTGATGCTAACGA
AACGTGATACCACTGT
AACGTGATACATTGGC
AACGTGATCAGATCTG
AACGTGATCATCAAGT
AACGTGATCGCTGATC
AACGTGATACAAGCTA

I have both combinations of barcodes AACGTGAT and AACGTGAT which is AACGTGATATGCCTAA and ATGCCTAAAACGTGAT but I only have AACGTGATGCGAGTAA and Not GCGAGTAAAACGTGAT

mourisl commented 1 year ago

@nouroddinc I have removed the sort in the branch "li_dev4". Could you please checkout this branch, "make clean; make" to recompile chromap, and give the new version a try?

nouroddinc commented 1 year ago

@mourisl I have tested the new branch and the output still shows the low number of mapped reads! chromap_with_read_format_new_branch.log

mourisl commented 1 year ago

I rechecked your running log. For the "_without_read_format" run, you used the files

1th read 1 file: cellranger_inputs/6bp_D01291_NG02620_S1_L001_R1_001.fastq.gz
1th read 2 file: cellranger_inputs/6bp_D01291_NG02620_S1_L001_R3_001.fastq
1th cell barcode file: cellranger_inputs/6bp_D01291_NG02620_S1_L001_R2_001.fastq

For the files with "_with_read_format", the files are:

1th read 1 file: cellranger_inputs/6bp_D01291_NG02620_S1_L001_R1_001.fastq.gz
1th read 2 file: 6bp_D01291_NG02620_linker2_R2.fastq.gz
1th cell barcode file: 6bp_D01291_NG02620_linker2_R2.fastq.gz

Since some of the files are in cellranger_inputs, and some are not, I just want to make sure you used the right input files.

nouroddinc commented 1 year ago

@mourisl I think you might be right. I tested with a new data set with smaller size to get faster output and sounds like they are very similar but still not identical. Is that normal?

chromap_with_read_format.log

chromap_without_read_format.log

mourisl commented 1 year ago

The "Number of barcodes in whitelist" is still different "580224" vs "539675". So I guess the manually extracted barcode and the automatic extracted barcode are still different. Could you please try:

  1. Reverse the order of "bc:" in --read-format, just to make sure the new implementation is correct.
  2. Could you please run Chromap with "--SAM" to generate the SAM format output. Then you can compare the barcode from the same read in two different input and check which one is correct.

Thank you for helping us debug the issue.

nouroddinc commented 1 year ago

@mourisl is this what you asked for?

1- I have tried reverse barcode like bc:60:67:-,bc:22:29:-,r1:0:-1,r2:117:-1 and it gives Less than 5% barcodes can be found or corrected based on the barcode whitelist error which makes sense to me as the log file shows chromap_with_read_format_rev.log

2- I have extracted CB:Z tag barcodes from both sam files and they match the whitelist. I mean I can't see any wrong barcode in neither of them.

barcodes_w_rf.txt

barcodes_wo_rf.txt

mourisl commented 1 year ago
  1. I meant to change the order in read format as bc:22:29,bc:60:67.
  2. Since SAM file contains the read id, you can check whether the same read will get the same barcode in this format.

Thank you for the testing.

nouroddinc commented 1 year ago

alright, I have two outputs.

1) with the normal barcode (bc:60:67,bc:22:29) chromap_with_read_format.log

sam files output in normal barcode is like this:

# samtools view -h aln.sam | sed -n '67,74 p'

lh00134:68:225MHKLT3:7:1281:23172:18977 99  chr1    3009659 60  115M    =   3009741 115 TAGAGTGACTGTTAAGAAGTTATGTGAAATTCTAGGAAGTCTGCCTTCATATGTTACTTTGCCTTTTCCCATTTCAGCTTTTAATATTCTTTTTTGATGTTGTTGTTCTATAAGT FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF5FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NM:i:0  MD:Z:115    CB:Z:ACTATGCACACTTCGA
lh00134:68:225MHKLT3:7:1281:23172:18977 147 chr1    3009741 60  33M =   3009659 -115    AATATTCTTTTTTGATGTTGTTGTTCTATAAGT   FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF   NM:i:0  MD:Z:33 CB:Z:ACTATGCACACTTCGA
lh00134:68:225MHKLT3:7:2242:32307:22630 163 chr1    3045894 60  33M =   3046071 327 TAGTATAACTGTGTAGTGAAAGCTATTTGCTCA   FFFFFFFFFFFFFFFFFF5FFFFFFFFFFFFFF   NM:i:0  MD:Z:33 CB:Z:GAATCTGAAAGGTACA
lh00134:68:225MHKLT3:7:2242:32307:22630 83  chr1    3046071 60  150M    =   3045894 -327    AAGTGCGCACATAGGATTTGTAGTATCTGCAATTTTGAGCACCTCATGGTATTCATGGAACATGTTCCAATGAACCATATGGATTTGTCATGTCTTAACTGATCAAAAGATTGGCCTTCCATTGTAGGGCCAAGTAACCTTATGCTAAGG  FFFFF5FFFFFFFFF5FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF  NM:i:3  MD:Z:5G20A86A36 CB:Z:GAATCTGAAAGGTACA
lh00134:68:225MHKLT3:7:1162:45492:5711  163 chr1    3062726 46  33M =   3062726 106 GTATGGAAGGAAACGGCGAGAGAAATAATGGAG   FFFFFFFFFFFFFFFFFFFF5FFFFFFFFFFFF   NM:i:0  MD:Z:33 CB:Z:ACACAGAAAGTACAAG
lh00134:68:225MHKLT3:7:1162:45492:5711  83  chr1    3062726 46  106M    =   3062726 -106    GTATGGAAGGAAACGGCGAGAGAAATAATGGAGGCCAAGACATGTTTCTGTTCAAGGCCCCCAAGTTTACTAAGAGTCTGTGCTTATAAAGTGGGGAGGCCCATCC  FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF  NM:i:1  MD:Z:103C2  CB:Z:ACACAGAAAGTACAAG
lh00134:68:225MHKLT3:7:2149:19696:3243  99  chr1    3094994 60  73M =   3095034 73  GCTCAATGAACTTGGCTTTGGGAACTGCTCCTGTGGTAACTAGCAGTCACATCTGTGTGTACCAATCTGCATA   FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF   NM:i:0  MD:Z:73 CB:Z:AAGGTACAGAGCTGAA
lh00134:68:225MHKLT3:7:2149:19696:3243  147 chr1    3095034 60  33M =   3094994 -73 TAGCAGTCACATCTGTGTGTACCAATCTGCATA   FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF   NM:i:0  MD:Z:33 CB:Z:AAGGTACAGAGCTGAA

2) the reverse barcode (bc:22:29,bc:60:67) log file:

chromap_with_read_format_rev.log

and reverse barcode sam file is like :

# samtools view -h aln_rev.sam | sed -n '67,74 p'

lh00134:68:225MHKLT3:7:1281:23172:18977 99  chr1    3009659 60  115M    =   3009741 115 TAGAGTGACTGTTAAGAAGTTATGTGAAATTCTAGGAAGTCTGCCTTCATATGTTACTTTGCCTTTTCCCATTTCAGCTTTTAATATTCTTTTTTGATGTTGTTGTTCTATAAGT FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF5FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NM:i:0  MD:Z:115    CB:Z:CACTTCGAACTATGCA
lh00134:68:225MHKLT3:7:1281:23172:18977 147 chr1    3009741 60  33M =   3009659 -115    AATATTCTTTTTTGATGTTGTTGTTCTATAAGT   FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF   NM:i:0  MD:Z:33 CB:Z:CACTTCGAACTATGCA
lh00134:68:225MHKLT3:7:2242:32307:22630 163 chr1    3045894 60  33M =   3046071 327 TAGTATAACTGTGTAGTGAAAGCTATTTGCTCA   FFFFFFFFFFFFFFFFFF5FFFFFFFFFFFFFF   NM:i:0  MD:Z:33 CB:Z:AAGGTACAGAATCTGA
lh00134:68:225MHKLT3:7:2242:32307:22630 83  chr1    3046071 60  150M    =   3045894 -327    AAGTGCGCACATAGGATTTGTAGTATCTGCAATTTTGAGCACCTCATGGTATTCATGGAACATGTTCCAATGAACCATATGGATTTGTCATGTCTTAACTGATCAAAAGATTGGCCTTCCATTGTAGGGCCAAGTAACCTTATGCTAAGG  FFFFF5FFFFFFFFF5FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF  NM:i:3  MD:Z:5G20A86A36 CB:Z:AAGGTACAGAATCTGA
lh00134:68:225MHKLT3:7:1162:45492:5711  163 chr1    3062726 46  33M =   3062726 106 GTATGGAAGGAAACGGCGAGAGAAATAATGGAG   FFFFFFFFFFFFFFFFFFFF5FFFFFFFFFFFF   NM:i:0  MD:Z:33 CB:Z:AGTACAAGACACAGAA
lh00134:68:225MHKLT3:7:1162:45492:5711  83  chr1    3062726 46  106M    =   3062726 -106    GTATGGAAGGAAACGGCGAGAGAAATAATGGAGGCCAAGACATGTTTCTGTTCAAGGCCCCCAAGTTTACTAAGAGTCTGTGCTTATAAAGTGGGGAGGCCCATCC  FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF  NM:i:1  MD:Z:103C2  CB:Z:AGTACAAGACACAGAA
lh00134:68:225MHKLT3:7:2149:19696:3243  99  chr1    3094994 60  73M =   3095034 73  GCTCAATGAACTTGGCTTTGGGAACTGCTCCTGTGGTAACTAGCAGTCACATCTGTGTGTACCAATCTGCATA   FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF   NM:i:0  MD:Z:73 CB:Z:GAGCTGAAAAGGTACA
lh00134:68:225MHKLT3:7:2149:19696:3243  147 chr1    3095034 60  33M =   3094994 -73 TAGCAGTCACATCTGTGTGTACCAATCTGCATA   FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF   NM:i:0  MD:Z:33 CB:Z:GAGCTGAAAAGGTACA

and the read2 file is this:

# zcat read_R2.fastq.gz | grep -A 3 -F -f list_of_read_ids.txt | awk 'BEGIN{RS="--"; ORS=""} {print $0}'

@lh00134:68:225MHKLT3:7:2149:19696:3243 2:N:0:TNAGGC
CACGCATTTGCTTCTCGCATCTGAGCTGAAATCCACGTGCTTGAGAGGCCAGAGCATTCGAAGGTACAGTGGCCGATGTTTCGCATCGGCGTACGACTAGATGTGTATAAGAGACAGTATGCAGATTGGTACACACAGATGTGACTGCTA
+
F5F-FF-FF-FFFF5FFFFFFFFFFFFFFFFFFFFFFFFF-FFFFFFFFF-FFFFFFFFFFFFFFFFFFFFFFFFF5FFFFFFFFFFFFFFFFF5FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

@lh00134:68:225MHKLT3:7:1281:23172:18977 2:N:0:TNAGGC
CAAGCTTTGGCTTCTCGCATCTCACTTCGAATCCACGTGCTCGATAGGCCAGAGCATTCGACTATGCAGTGGCCGATGTTTCGCATCGGCGTACGACTAGATGTGTATAAGAGACAGACTTATAGAACAACAACATCAAAAAAGAATATT
+
F55FFF5FFFFFFFF55555555555-555555555555555FFFFFFFFFFFFFFF-F-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

@lh00134:68:225MHKLT3:7:1162:45492:5711 2:N:0:TNAGGC
CACTCGTTGCCCTCTCGCATCTAGTACAAGATCCACGTGCTTGATAGGCCAGAGCATTCGACACAGAAGTGGCCGATGTTTCGCATCGGCGTACGACTAGATGTGTATAAGAGACAGGTATGGAAGGAAACGGCGAGAGAAATAATGGAG
+
F5-FF-5FFFF-F-FF-5FFF5555555555-555-55-555555555FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF55FFFFFFFFFFFFFFFFFFFFF5FFFFFFFFFFFF

@lh00134:68:225MHKLT3:7:2242:32307:22630 2:N:0:TNAGGC
CAAGCGTTGGCTTCTCGCATCTAAGGTACAATCCACGTGCTTGAGAGGCCAGAGCATTCGGAATCTGAGTGGCCGATGTTTCGCATCGGCGTACGACTAGATGTGTATAAGAGACAGTAGTATAACTGTGTAGTGAAAGCTATTTGCTCA
+
-FFF-FFFF555-FFFFF5FFFF5555555FFF-FFFFFFFFFFFFFF-FFFFFFFFFFFFFFFFFFFFFFFFFF5FFFFFFFFFFF5FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF5FFFFFFFFFFFFFF

hopefully, this is what you were looking for. Thanks

mourisl commented 1 year ago

Seems the barcode order issue is indeed fixed. Are the barcodes from readformat run the same as your manually-created barcode file (without readformat) based on the SAM file?

nouroddinc commented 1 year ago

it sounds like NOT; they are in reverse of each other!

chromap --preset atac -x GRCm38_genome.index -r GRCm38_genome.fa -1 read_R1.fastq.gz -2 read_R2.fastq.gz --SAM -o aln.sam -b read_R2.fastq.gz --read-format bc:60:67,bc:22:29,r1:0:-1,r2:117:-1 -t 10 --barcode-whitelist bc50.txt 2> chromap_with_read_format.log

the out put is

# samtools view -h aln.sam | sed -n '67,74 p'
lh00134:68:225MHKLT3:7:1281:23172:18977 99  chr1    3009659 60  115M    =   3009741 115 TAGAGTGACTGTTAAGAAGTTATGTGAAATTCTAGGAAGTCTGCCTTCATATGTTACTTTGCCTTTTCCCATTTCAGCTTTTAATATTCTTTTTTGATGTTGTTGTTCTATAAGT FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF5FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NM:i:0  MD:Z:115    CB:Z:ACTATGCACACTTCGA
lh00134:68:225MHKLT3:7:1281:23172:18977 147 chr1    3009741 60  33M =   3009659 -115    AATATTCTTTTTTGATGTTGTTGTTCTATAAGT   FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF   NM:i:0  MD:Z:33 CB:Z:ACTATGCACACTTCGA
lh00134:68:225MHKLT3:7:2242:32307:22630 163 chr1    3045894 60  33M =   3046071 327 TAGTATAACTGTGTAGTGAAAGCTATTTGCTCA   FFFFFFFFFFFFFFFFFF5FFFFFFFFFFFFFF   NM:i:0  MD:Z:33 CB:Z:GAATCTGAAAGGTACA
lh00134:68:225MHKLT3:7:2242:32307:22630 83  chr1    3046071 60  150M    =   3045894 -327    AAGTGCGCACATAGGATTTGTAGTATCTGCAATTTTGAGCACCTCATGGTATTCATGGAACATGTTCCAATGAACCATATGGATTTGTCATGTCTTAACTGATCAAAAGATTGGCCTTCCATTGTAGGGCCAAGTAACCTTATGCTAAGG  FFFFF5FFFFFFFFF5FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF  NM:i:3  MD:Z:5G20A86A36 CB:Z:GAATCTGAAAGGTACA
lh00134:68:225MHKLT3:7:1162:45492:5711  163 chr1    3062726 46  33M =   3062726 106 GTATGGAAGGAAACGGCGAGAGAAATAATGGAG   FFFFFFFFFFFFFFFFFFFF5FFFFFFFFFFFF   NM:i:0  MD:Z:33 CB:Z:ACACAGAAAGTACAAG
lh00134:68:225MHKLT3:7:1162:45492:5711  83  chr1    3062726 46  106M    =   3062726 -106    GTATGGAAGGAAACGGCGAGAGAAATAATGGAGGCCAAGACATGTTTCTGTTCAAGGCCCCCAAGTTTACTAAGAGTCTGTGCTTATAAAGTGGGGAGGCCCATCC  FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF  NM:i:1  MD:Z:103C2  CB:Z:ACACAGAAAGTACAAG
lh00134:68:225MHKLT3:7:2149:19696:3243  99  chr1    3094994 60  73M =   3095034 73  GCTCAATGAACTTGGCTTTGGGAACTGCTCCTGTGGTAACTAGCAGTCACATCTGTGTGTACCAATCTGCATA   FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF   NM:i:0  MD:Z:73 CB:Z:AAGGTACAGAGCTGAA
lh00134:68:225MHKLT3:7:2149:19696:3243  147 chr1    3095034 60  33M =   3094994 -73 TAGCAGTCACATCTGTGTGTACCAATCTGCATA   FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF   NM:i:0  MD:Z:33 CB:Z:AAGGTACAGAGCTGAA

and without --read-format

chromap --preset atac -x GRCm38_genome.index -r GRCm38_genome.fa -1 read_R1.fastq.gz -2 read_R2.fq --SAM -o aln_wo.sam -b read_R3.fq -t 10 --barcode-whitelist bc50.txt 2> chromap_without_read_format.log

# samtools view -h aln_wo.sam | sed -n '67,74 p'
lh00134:68:225MHKLT3:7:1281:23172:18977 99  chr1    3009659 60  115M    =   3009741 115 TAGAGTGACTGTTAAGAAGTTATGTGAAATTCTAGGAAGTCTGCCTTCATATGTTACTTTGCCTTTTCCCATTTCAGCTTTTAATATTCTTTTTTGATGTTGTTGTTCTATAAGT FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF5FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NM:i:0  MD:Z:115    CB:Z:CACTTCGAACTATGCA
lh00134:68:225MHKLT3:7:1281:23172:18977 147 chr1    3009741 60  33M =   3009659 -115    AATATTCTTTTTTGATGTTGTTGTTCTATAAGT   FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF   NM:i:0  MD:Z:33 CB:Z:CACTTCGAACTATGCA
lh00134:68:225MHKLT3:7:2242:32307:22630 163 chr1    3045894 60  33M =   3046071 327 TAGTATAACTGTGTAGTGAAAGCTATTTGCTCA   FFFFFFFFFFFFFFFFFF5FFFFFFFFFFFFFF   NM:i:0  MD:Z:33 CB:Z:AAGGTACAGAATCTGA
lh00134:68:225MHKLT3:7:2242:32307:22630 83  chr1    3046071 60  150M    =   3045894 -327    AAGTGCGCACATAGGATTTGTAGTATCTGCAATTTTGAGCACCTCATGGTATTCATGGAACATGTTCCAATGAACCATATGGATTTGTCATGTCTTAACTGATCAAAAGATTGGCCTTCCATTGTAGGGCCAAGTAACCTTATGCTAAGG  FFFFF5FFFFFFFFF5FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF  NM:i:3  MD:Z:5G20A86A36 CB:Z:AAGGTACAGAATCTGA
lh00134:68:225MHKLT3:7:1162:45492:5711  163 chr1    3062726 46  33M =   3062726 106 GTATGGAAGGAAACGGCGAGAGAAATAATGGAG   FFFFFFFFFFFFFFFFFFFF5FFFFFFFFFFFF   NM:i:0  MD:Z:33 CB:Z:AGTACAAGACACAGAA
lh00134:68:225MHKLT3:7:1162:45492:5711  83  chr1    3062726 46  106M    =   3062726 -106    GTATGGAAGGAAACGGCGAGAGAAATAATGGAGGCCAAGACATGTTTCTGTTCAAGGCCCCCAAGTTTACTAAGAGTCTGTGCTTATAAAGTGGGGAGGCCCATCC  FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF  NM:i:1  MD:Z:103C2  CB:Z:AGTACAAGACACAGAA
lh00134:68:225MHKLT3:7:2149:19696:3243  99  chr1    3094994 60  73M =   3095034 73  GCTCAATGAACTTGGCTTTGGGAACTGCTCCTGTGGTAACTAGCAGTCACATCTGTGTGTACCAATCTGCATA   FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF   NM:i:0  MD:Z:73 CB:Z:GAGCTGAAAAGGTACA
lh00134:68:225MHKLT3:7:2149:19696:3243  147 chr1    3095034 60  33M =   3094994 -73 TAGCAGTCACATCTGTGTGTACCAATCTGCATA   FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF   NM:i:0  MD:Z:33 CB:Z:GAGCTGAAAAGGTACA

this means that reverse barcode that i already tested is aligned with the one with out --read-format

mourisl commented 1 year ago

Could you please be more specific about whether the "without --read-format" run had mistakes due to the input order being wrong, or the "with --read-format" run generated wrong barcodes? Thank you.

nouroddinc commented 1 year ago

@mourisl I think it works now pretty well with my manual barcode extraction pipeline. If I use --read-format bc:22:29,bc:60:67,r1:0:-1,r2:117:-1 which we called reverse barcode in the above that matches exactly with the manual barcode extraction one. You are good to close this. Thank you so much for your help!