unexpected unalign reads in capture sequencing

danyuewang commented 1 year ago

Hi,

I analyzed the capture sequencing data (TRB) with mixcr, but got some very low align pct in align step, and the unaligned reads were matched well with TRBV region. What's the rules that specify the align reads and unaligned reads? Whether the reads must contain CDR3 parts? I found the Alignment failed: no CDR3 parts value is high.

Maybe I have missed something or misunderstood it.

Exact MiXCR commands

java17 -jar /bioinfo/software/packages/mixcr-4.5.0/mixcr.jar align --report ./D01.align.report.txt --json-report  D01.align.report.json --preset exome-seq --write-all --dna --species hsa  --not-aligned-R1 D01.unaligned.R1.fastq --not-aligned-R2 D01.unaligned.R2.fastq --not-parsed-R1 D01.unparsed.R1.fastq --not-parsed-R2 D01.unparsed.R2.fastq  D01.R1.fastq D01.R2.fastq D01.vdjca -f

MiXCR report files

Analysis date: Tue Sep 26 16:44:53 CST 2023 Input file(s): D01.R1.fastq,D01.R2.fastq Output file(s): D01.vdjca,D01.unaligned.R1.fastq,D01.unaligned.R2.fastq,D01.unparsed.R1.fastq,D01.unparsed.R2.fastq Version: 4.5.0; built=Fri Sep 22 20:39:05 CST 2023; rev=cdb24b4fb7; lib=repseqio.v3.0.1Command line arguments: align --report ./D01.align.report.txt --json-report D01.align.report.json --preset exome-seq --write-all --dna --species hsa --not-aligned-R1 D01.unaligned.R1.fastq --not-aligned-R2 D01.unaligned.R2.fastq --not-parsed-R1 D01.unparsed.R1.fastq --not-parsed-R2 D01.unparsed.R2.fastq D01.R1.fastq D01.R2.fastq D01.vdjca -f Analysis time: 17.6s Total sequencing reads: 288047 Successfully aligned reads: 70035 (24.31%) Coverage (percent of successfully aligned): CDR3: 0.09% FR3_TO_FR4: 0.01% CDR2_TO_FR4: 0.01% FR2_TO_FR4: 0% CDR1_TO_FR4: 0% VDJRegion: 0% Alignment failed: no hits (not TCR/IG?): 29941 (10.39%) Alignment failed: no CDR3 parts: 173014 (60.06%) Alignment failed: low total score: 15057 (5.23%) Overlapped: 261852 (90.91%) Overlapped and aligned: 61085 (21.21%) Overlapped and not aligned: 200767 (69.7%)

Some of the unaligned reads' bam info:

E200007575L1C001R0010010731     163     7       142045179       60      150M    =       142045246       217     TACCTTCTATCAGGACCTAGAAAGGATGTAAAACGGCTGGGTATAAATATCCCCTGGGTCTGGGGAAACTGTCAGGAGCAGTGACATCACAGGAATAACCACCAACCAAGGCCAAGGAGACCAGAGCCCAGCACCTCACCCAGAGGACCC      GGF5GGGGGFFB<FFEAGGFGGGD&GGFGFGG;FDEEGG,=GGFGGGGGGFBFFGFDFFFGFF=FFEGFGBGFFFFGFFGFGFG@GG)G.GF>FG#FGFDGDGFGFFGFF.#FFFDEGFFFEG6GD8FGGFFGFFGCGGFGG9EFBFF2G  NM:i:3  MD:Z:1G35C57A54 MC:Z:150M       AS:i:138   XS:i:72  RG:Z:D01
E200007575L1C001R0010010731     83      7       142045246       60      150M    =       142045179       -217    ACTGTCAGGAGCAGTGACATCACAGGAAAAACCACCAACCAAGGCCAAGGAGACCAGAGCCCAGCACCTCACCCAGAGGACCCCAGTCTGAGGCCCCATCTCAGACCCGAGGCTAGCATGGGCTGCAGGCTGCTCTGCTGTGCGGTTCTC      >AGFFFGFCFFCFFFCAFFFFFFGFFFFGAFEFFCF:FFFFFFDEFFF?:FFFBEFFFFFFGFGFF>FEF#E,;FEFFFFFFDF?F(F&F;FFFFFF3F6FFFF-C1FFEF@@=FFF3+FFFG@FGEFFFDFC3F;FFFFEFGGFBFFFF  NM:i:1  MD:Z:88A61      MC:Z:150M       AS:i:145   XS:i:121 RG:Z:D01        XA:Z:7,-142012919,150M,6;
E200007575L1C001R0010077877     163     7       142045420       60      127M23S =       142045420       127     GGTTCACATCAGCTGTCCTTGAATTCGAAACTTTTTCCTTGTGATTTCAGCAACAAGCCTCCTCCTGGGCTCTGCCTGAATTTTGTCCCTTTCCCCCCGCAGTCCCTATGGAAACGGGAGTTACGCAAGATCGGAAGAGCGTCGTGTAGG      FGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGFFGGGGGGGGGGGGGGFGGGGGGFGGGGGGGGGFGGGGGGGGGFFGGFGFFGGFGGGGGGGGGGGGGGGFGGGGGGGGGGGGGFGGGGEGGGGGGFFGF  NM:i:1  MD:Z:106C20     MC:Z:23S127M    AS:i:122   XS:i:0   RG:Z:D01
E200007575L1C001R0010077877     83      7       142045420       60      23S127M =       142045420       -127    TCAGACGTGTGCTCTTCCGATCTGGTTCACATCAGCTGTCCTTGAATTCGAAACTTTTTCCTTGTGATTTCAGCAACAAGCCTCCTCCTGGGCTCTGCCTGAATTTTGTCCCTTTCCCCCCGCAGTCCCTATGGAAACGGGAGTTACGCA      FGGFFGGGFG>GGFGGFGGGGGGGFGGGGGGGGGFGGFGFGFGGGGGGGDGGFGGFGGGGGGGFGGGGGGGGFGGGGGGFGGGGFGFGGFGFGFGGFGGGGGGGGGGFGFFFGGGFFFFFFGGGFFFFFGGGGGGGGGGGGGFGGGGFGG  NM:i:1  MD:Z:106C20     MC:Z:127M23S    AS:i:122   XS:i:33  RG:Z:D01

Thank you for any help!

danyuewang commented 1 year ago

Also, I found there were significant differences at the final clones' ReadsCount between DNA and RNA-seq for another sample. And the RNA data had more assemble clones. Was this result normal? I used to think gDNA and RNA data both can apply to analyze IG/TR rearrangement, but the DNA result is unexpected, Can you give some advices?

Thank you.

mizraelson commented 1 year ago

Hi,

The exome-seq preset processes two types of reads:

Those that fully cover the CDR3 region.
Those that partially cover the CDR3 region.

Any reads that don't overlap with the CDR3 region are discarded. This is because, without covering the CDR3, it's challenging to reliably assign alignments to a specific clone. However, the approach differs for SingleCell analysis. In that context, even reads that don't cover the CDR3 are retained. This is due to the understanding that such a read would have originated from one of the clones within a particular cell - a certainty we don't have in bulk repertoire analysis.

The reads you have shared seem to only span portions of the V gene (like the UTR, Leader, and Intron), making it ambiguous as to which clones they should be attributed to.

However, a yield of 24.31% is quite commendable, especially for targeted exome-seq. Typically, RNA-seq provides better yields because multiple target RNA molecules exist per cell. So, it's not uncommon to see more favorable results there. Nevertheless, for in-depth repertoire analysis, targeted TCR/BCR sequencing is essential. If possible, using UMI barcoding can further enhance precision by allowing error correction.

danyuewang commented 1 year ago

Thank you for the clear interpretation!

milaboratory / mixcr

unexpected unalign reads in capture sequencing #1367

Exact MiXCR commands

MiXCR report files