milaboratory / mixcr

MiXCR is an ultimate software platform for analysis of Next-Generation Sequencing (NGS) data for immune profiling.
https://mixcr.com
Other
330 stars 79 forks source link

Alignment failed 10x-vdj-tcr 10xGenomics Preset; Reduce trimming quality threshold #1287

Closed abyappan closed 1 year ago

abyappan commented 1 year ago

Expected Result:

I am working with a spatial gene expression data set processed by Space Ranger 2.0.0 from Visium 10xGenomics to pull out TCR sequences. I used the '10x-vdj-tcr' 10xGenomics preset, but my alignment is failing.

Actual Result:

The alignment is failing. I was thinking to reduce the --trimming-quality-threshold from the align command. I have looked at the mixcr exportAlignmentsPretty output and the data structure does not seem to be the issue. Could you either suggest how to incorporate this command into the preset, or another way to solve my alignment failure?

Exact MiXCR commands

mixcr analyze 10x-vdj-tcr --species hsa folder/file_S1_L{{n}}_R1_001.fastq.gz folder/file_S1_L{{n}}_R2_001.fastq.gz output_folder

MiXCR report files exportAlignmentsPretty: first few output lines:

>>> Read ids: 1083138

>>> Tags:
>>> CELL: AGTTCGTGCAGACCTT
>>> UMI: GTGAACACCA

                                                                              <J                
             _  S  N  Q  L  F  F  F  F  F  F  F  F  L  L  G  S  I  C  L  F  P  A  K  G  T  R    
  Quality    27727275272277777777777777777777777777777777777777777277777777777777777777777777   
  Target0  0 GTTCAAACCAACTTTTTTTTTTTTTTTTTTTTTTTTTTTTGCTGGGAAGTATATGTCTGTTCCCAGCAAAAGGGACTAGG 79  Score
TRGJP2*00 46                                                                  gcaaaagggactagg 60  125

              L  I  V _   
  Quality    7777777777   
  Target0 80 CTCATAGTAA 89  Score
TRGJP2*00 61 ctcatagtaa 70  125

>>> Read ids: 1415650

>>> Tags:
>>> CELL: GTGATGGATTAGAACA
>>> UMI: CATGCTTGTA

                                                                              <J                
             _  F  F  F  L  F  F  F  F  F  F  F  F  F  L  P  T  H  G  S  L  *  A  E  G  T  K    
  Quality    25522772222277777777777777777777777777777777777777777777777777777777777777777777   
  Target0  0 CATTCTTTTTTCTTTTTTTTTTTTTTTTTTTTTTTTTTTTCCTGCCCACCCATGGAAGTTTATGAGCTGAAGGGACTAAG 79  Score
TRGJP1*00 46                                                                  gctgaagggactaag 60  125

              L  I  V _   
  Quality    7777777777   
  Target0 80 CTCATAGTAA 89  Score
TRGJP1*00 61 ctcatagtaa 70  125

>>> Read ids: 1474478

>>> Tags:
>>> CELL: GCTTGCCATAGTGGAT
>>> UMI: AAGTTGTACG

                                                                         CDR3><FR4              
                                                                             <J                 
             _  F  F  F  F  F  F  F  F  F  F  F  F  L  Q  P  L  S  L  E  M  A  A  K  G  T  R    
  Quality    77227222277777777777777777777777777777777775777777777777777777777777777777777777   
  Target0  0 TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGCAGCCACTTTCATTAGAGATGGCTGCAAAAGGGACTAGG 79  Score
TRGJP2*00 45                                                                 tgcaaaagggactagg 60  130

              L  I  V _   
  Quality    7777757777   
  Target0 80 CTCATAGTAA 89  Score
TRGJP2*00 61 ctcatagtaa 70  130

>>> Read ids: 2192256

>>> Tags:
>>> CELL: TGTTGGATGGACTTCT
>>> UMI: TAGCGCTTGT

                                                     <J                  CDR3><FR4              
             _  L  R  L  S  F  F  F  F  F  F  F  F  F  S  S  D  W  I  K  T  F  A  K  G  T  R    
  Quality    25725225227525777777777777777757777777777777277777777777777777777777777777777777   
  Target0  0 TTTTGCGCTTGTCATTTTTTTTTTTTTTTTTTTTTTTTTTTAGTAGTGATTGGATCAAGACGTTTGCAAAAGGGACTAGG 79  Score
TRGJP2*00 21                                         tagtagtgattggatcaagacgtttgcaaaagggactagg 60  250

              L  I  V _   
  Quality    7777777777   
  Target0 80 CTCATAGTAA 89  Score
TRGJP2*00 61 ctcatagtaa 70  250

10x-vdj-tcr:

Analysis date: Wed Jul 26 13:30:46 EDT 2023
Input file(s): CytAssist_FFPE_Human_Skin_Melanoma_fastqs/CytAssist_FFPE_Human_Skin_Melanoma_S1_L001_R1_001.fastq.gz,CytAssist_FFPE_Human_Skin_Melanoma_fastqs/CytAssist_FFPE_Human_Skin_Melanoma_S1_L001_R2_001.fastq.gz,CytAssist_FFPE_Human_Skin_Melanoma_fastqs/CytAssist_FFPE_Human_Skin_Melanoma_S1_L002_R1_001.fastq.gz,CytAssist_FFPE_Human_Skin_Melanoma_fastqs/CytAssist_FFPE_Human_Skin_Melanoma_S1_L002_R2_001.fastq.gz,CytAssist_FFPE_Human_Skin_Melanoma_fastqs/CytAssist_FFPE_Human_Skin_Melanoma_S1_L003_R1_001.fastq.gz,CytAssist_FFPE_Human_Skin_Melanoma_fastqs/CytAssist_FFPE_Human_Skin_Melanoma_S1_L003_R2_001.fastq.gz,CytAssist_FFPE_Human_Skin_Melanoma_fastqs/CytAssist_FFPE_Human_Skin_Melanoma_S1_L004_R1_001.fastq.gz,CytAssist_FFPE_Human_Skin_Melanoma_fastqs/CytAssist_FFPE_Human_Skin_Melanoma_S1_L004_R2_001.fastq.gz
Output file(s): ./CytAssist_FFPE_Human_Skin_Melanoma_Results.vdjca
Version: 4.3.2; built=Tue Apr 11 14:35:07 EDT 2023; rev=123d699964; lib=repseqio.v2.2
Command line arguments: align --report ./CytAssist_FFPE_Human_Skin_Melanoma_Results.align.report.txt --json-report ./CytAssist_FFPE_Human_Skin_Melanoma_Results.align.report.json --preset 10x-vdj-tcr --species hsa CytAssist_FFPE_Human_Skin_Melanoma_fastqs/CytAssist_FFPE_Human_Skin_Melanoma_S1_L{{n}}_R1_001.fastq.gz CytAssist_FFPE_Human_Skin_Melanoma_fastqs/CytAssist_FFPE_Human_Skin_Melanoma_S1_L{{n}}_R2_001.fastq.gz ./CytAssist_FFPE_Human_Skin_Melanoma_Results.vdjca
Analysis time: 15.72m
Total sequencing reads: 193483192
Successfully aligned reads: 360 (0%)
Alignment failed: no hits (not TCR/IG?): 191103257 (98.77%)
Alignment failed: low total score: 2379575 (1.23%)
Overlapped: 0 (0%)
Overlapped and aligned: 0 (0%)
Overlapped and not aligned: 0 (0%)
Alignment-aided overlaps, percent of overlapped and aligned: 0 (NaN%)
No CDR3 parts alignments, percent of successfully aligned: 3 (0.83%)
Partial aligned reads, percent of successfully aligned: 357 (99.17%)
Realigned with forced non-floating bound: 0 (0%)
Realigned with forced non-floating right bound in left read: 0 (0%)
Realigned with forced non-floating left bound in right read: 0 (0%)
TRA chains: 5 (1.39%)
TRA non-functional: 0 (0%)
TRD chains: 1 (0.28%)
TRD non-functional: 0 (0%)
TRG chains: 338 (93.89%)
TRG non-functional: 0 (0%)
IGH chains: 16 (4.44%)
IGH non-functional: 0 (0%)
Tag parsing report:
  Execution time: 0ns
  Total reads: 193483192
  Matched reads: 193483192 (100%)
  Projection +R1 +R2: 193483192 (100%)
  For variant 0:
    For projection [1, 2]:
      CELL:Left position: 0
      UMI:Left position: 16
      CELL:Right position: 16
      UMI:Right position: 26
      R2:Left position: 0
      Variants: 0
      Cost: 0
      CELL length: 16
      UMI length: 10
      R2 length: 90
mizraelson commented 1 year ago

Hi, what 10x chemistry did you use? Was it a 5' VDJ enriched protocol?

abyappan commented 1 year ago

Hi, thank you for getting back to me. No, it was the 3' whole transcriptome gene expression.

mizraelson commented 1 year ago

The preset you used was designed for enriched VDJ 5' data. Usually, with 3', we don't really expect a high yield, as the reads often don't cover the CDR3 region. If you could let me know what chemistry was used — v2 or v3 — for your 3' 10x analysis, I can assist you with the correct command. The versions differ in barcode patterns and whitelists.

abyappan commented 1 year ago

I understand, thank you for explaining. The following chemistry was used: Visium V4 Slide - FFPE v2. Please let me know if there is any additional information I can provide you with.

mizraelson commented 1 year ago

It seems that you used 3' v3 judging by the structure of the library they provide. Please try the command below:

mixcr analyze 10x-sc-5gex \
--species hsa \
-MrefineTagsAndSort.whitelists.CELL=builtin:3M-febrary-2018 \
--tag-pattern "^(CELL:N{16})(UMI:N{12})\^(R2:*)" \
input_R1.fastq.gz \
input_R2.fastq.gz \
output
abyappan commented 1 year ago

Thank you for the reply. When I ran the code, I received the following error message:

        No preset with name "10x-sc-5gex".
        Here are supported presets with similar names:
        - 10x-5gex-cdr3
- 10x-5gex-full-length
- 10x-vdj-bcr
- split-seq-vdj-3gex
- 10x-vdj-bcr-full-length

        To list all built-in presets run `mixcr listPresets`.

I am running MiXCR v4.3.2. What do you recommend?

mizraelson commented 1 year ago

I recommend updating to version 4.4.2, which is the latest version and includes numerous fixes and improvements.

abyappan commented 1 year ago

Thank you, this helped and solved my problem. I appreciate all your suggestions and time.

abyappan commented 1 year ago

Hi there, I had another question related to this issue:

When I ran the provided code, this was the output:

Analysis time: 26.32m
Total sequencing reads: 193483192
Successfully aligned reads: 299532 (0.15%)
Coverage (percent of successfully aligned):
  CDR3: 0%
  FR3_TO_FR4: 0%
  CDR2_TO_FR4: 0%
  FR2_TO_FR4: 0%
  CDR1_TO_FR4: 0%
  VDJRegion: 0%
Alignment failed: no hits (not TCR/IG?): 191314513 (98.88%)
Alignment failed: low total score: 1869147 (0.97%)
Overlapped: 0 (0%)
Overlapped and aligned: 0 (0%)
Overlapped and not aligned: 0 (0%)
Alignment-aided overlaps, percent of overlapped and aligned: 0 (NaN%)
No CDR3 parts alignments, percent of successfully aligned: 268689 (89.7%)
Partial aligned reads, percent of successfully aligned: 30843 (10.3%)
Realigned with forced non-floating bound: 0 (0%)
Realigned with forced non-floating right bound in left read: 0 (0%)
Realigned with forced non-floating left bound in right read: 0 (0%)
TRA chains: 10241 (3.42%)
TRA non-functional: 0 (0%)
TRB chains: 37565 (12.54%)
TRB non-functional: 0 (0%)
TRD chains: 91 (0.03%)
TRD non-functional: 0 (0%)
TRG chains: 4355 (1.45%)
TRG non-functional: 0 (0%)
TRAD chains: 93682 (31.28%)
TRAD non-functional: 0 (0%)
IGH chains: 50148 (16.74%)
IGH non-functional: 0 (0%)
IGK chains: 12057 (4.03%)
IGK non-functional: 0 (0%)
IGL chains: 91393 (30.51%)
IGL non-functional: 0 (0%)
Trimming report:
  R1 reads trimmed left: 10056 (0.01%)
  R1 reads trimmed right: 2261 (0%)
  Average R1 nucleotides trimmed left: 2.916532408665245E-4
  Average R1 nucleotides trimmed right: 3.033286736348654E-4
Tag parsing report:
  Execution time: 0ns
  Total reads: 193483192
  Matched reads: 193483192 (100%)
  Projection +R1 +R2: 193483192 (100%)
  For variant 0:
    For projection +R1 +R2:
      CELL:Left position: 0
      UMI:Left position: 16
      CELL:Right position: 16
      UMI:Right position: 28
      R2:Left position: 0
      Variants: 0
      Cost: 0
      CELL length: 16
      UMI length: 12
      R2 length: 90

As a reminder, this was 3' sequencing which does not fully cover the CDR3 region and therefore does not yield a high number of reads.

I was reading the section "Upstream analysis steps" in the "Analysis Overview" section of the miXCR website. Would using assemblePartial or extend from the CDR3 Extension help in this case, since this was 3' sequencing (does not fully cover the CDR3 region) and we do not expect a high yield of reads? Is there any way to improve the recovery of reads and clones identified?

mizraelson commented 1 year ago

Hi! You are right and both the assemblePartial and extend steps are already included in the pipeline mentioned above. The numbers you're observing seem reasonable, considering the non-enriched 3' end protocol.

abyappan commented 1 year ago

Thank you so much for this response. Good to know -- just wanted to make sure I had tried everything I could with the pipeline for this data. I really appreciate all your help!