milaboratory / mixcr

MiXCR is an ultimate software platform for analysis of Next-Generation Sequencing (NGS) data for immune profiling.
https://mixcr.com
Other
329 stars 79 forks source link

mixcr assemble error with maxConsensuses=1 #1816

Closed DMaspero closed 6 days ago

DMaspero commented 1 week ago

Checklist before submitting the issue:

Dear mixcr developers, First of all, thank you for providing such a powerfull tool. I am trying to retrieve TCR clones from sequencing files using a molecular identifier barcode of 35bp.

The alignment step works as expected, but I have some trouble running the following assemble step. In particular, looking the tsv file obtained with exportClones, I noticed that there are hundres of clones with the same barcode. I was expecting to observe only one clone for each barcode.

I tried to change that behaviour by adding: -Passembler.maxConsensuses=1 to the command, but unfortunatelly, I am getting the following error.

MiXCR report files

   Version: 4.7.0; built=Wed Aug 07 21:19:48 CEST 2024; rev=976ba14139; lib=repseqio.v5.1
        OS: Linux
      Java: 22.0.1-internal Cmd args: assemble -Passembler.maxConsensuses=1 -OassemblingFeatures=CDR3 -OseparateByJ=true -OseparateByV=true -a --report ./sample.assemble.report.txt --json-report ./sample.assemble.report.json sample.vdjca sample.clna
picocli.CommandLine$ExecutionException: Error while running command assemble java.lang.IllegalStateException: Failed to apply the following override (NPE): {assembler.maxConsensuses=1}
    at com.milaboratory.mixcr.cli.Main.registerExceptionHandlers$lambda-17(SourceFile:420)
    at picocli.CommandLine.execute(CommandLine.java:2088)
    at com.milaboratory.mixcr.cli.Main.execute(SourceFile:105)
    at com.milaboratory.mixcr.cli.Main.main(SourceFile:101)
Caused by: java.lang.IllegalStateException: Failed to apply the following override (NPE): {assembler.maxConsensuses=1}
    at com.milaboratory.o.o.invoke(SourceFile:121)
    at com.milaboratory.o.eh.invoke(SourceFile:42)
    at com.milaboratory.cli.CommandWithParametersKt$sam$com_milaboratory_cli_POverride$0.apply(SourceFile)
    at com.milaboratory.o.w.resolve(SourceFile:188)
    at com.milaboratory.o.w.resolve$default(SourceFile:171)
    at com.milaboratory.mixcr.cli.CommandAssemble$Cmd.run1(SourceFile:246)
    at com.milaboratory.mixcr.cli.MiXCRCommandWithOutputs.run0(SourceFile:69)
    at com.milaboratory.mixcr.cli.MiXCRCommand.run(SourceFile:37)
    at picocli.CommandLine.executeUserObject(CommandLine.java:1939)
    at picocli.CommandLine.access$1300(CommandLine.java:145)
    at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2358)
    at picocli.CommandLine$RunLast.handle(CommandLine.java:2352)
    at picocli.CommandLine$RunLast.handle(CommandLine.java:2314)
    at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2179)
    at picocli.CommandLine$RunLast.execute(CommandLine.java:2316)
    at com.milaboratory.mixcr.cli.Main.registerLogger$lambda-32(SourceFile:539)
    at picocli.CommandLine.execute(CommandLine.java:2078)
    ... 2 more

Do you have any suggestions?

Thank you in advance for your time and help, Davide Maspero

mizraelson commented 1 week ago

Hi, could you please elaborate on the wet lab protocol and the preset you use?

DavideMaspero commented 1 week ago

Dear @mizraelson , thanks for the prompt response. I have two paired fastq files, in read_1 I have the sequence that should align to the CDR3 genomic region

@readID/1
GTGGGACAAGAGGATCAGGGTTCCCGCCTCTCAGTACGTCAGCAGAGGCATGGCGACCTTATCAGTCCCTTTGCGTTCCCGTACAATGTATTGTCTTCCTAAGACTGCTGTCCCATTTTTTTTTTTTTTT
+
ECFHEFDDFDFHCECCDFAGBDEBBGFCCGCEDECDGDBDDGCFBEIGGDCGBCEEFDDCECHDEDHHGBCCGGFCDFGHFDEGFEBGCE?CFCCDCH?BFBHEFCDDDFBC/EF@B?ABCD3DDDCDDC

In read_2 I have a barcode of 35bp that should identify each original molecule, like a UMI sequence.

@readID/2
TCCCTTTGCGTTCCCGTACAATGTATGCTGTCCCA
+
BG&=CC@<=;FE>A<:HECE+DEEGF3AF=FGDIC

These are the two steps that I was trying to run:

    mixcr align \
        -p rna-seq \
        -Xmx50g \
        --tag-pattern '^(R1:*)\^(UMI:N{35})' --tag-max-budget 0 \
        --threads 16 \
        --species hsa \
        -OsaveOriginalReads=true \
        -OallowPartialAlignments=true \
        -OvParameters.geneFeatureToAlign="VTranscriptWithout5UTRWithP" \
        --report ./${read1.simpleName}.align.report.txt \
        --json-report ./${read1.simpleName}.align.report.json \
        ${read1} ${read2} \
        ./${read1.simpleName}.vdjca
    mixcr assemble \
        -Xmx50g \
        -Passembler.maxConsensuses=1 \
        -OassemblingFeatures='CDR3' \
        -OseparateByJ=true \
        -OseparateByV=true \
        -a \
        --report ./${read1.simpleName}.assemble.report.txt \
        --json-report ./${read1.simpleName}.assemble.report.json \
        ${read1.simpleName}.vdjca \
        ${read1.simpleName}.clna

And then, extracting the results for downstream analyses with:

    mixcr exportClones \
        -Xmx50g \
        --dont-split-files \
        --prepend-columns \
        -topChains \
        -isotype primary \
        -tags Molecule \
        ${read1.simpleName}.clna ./${read1.simpleName}.contigs.tsv

Without -Passembler.maxConsensuses=1 I was able to get the results.

Thank you!

mizraelson commented 1 week ago

Try running:

 mixcr analyze generic-amplicon-with-umi \
    --species hsa \
    --rna \
    --tag-pattern '^(R1:*)\^(UMI:N{35})' \
    --floating-left-alignment-boundary \
    --floating-right-alignment-boundary C \
    --assemble-clonotypes-by CDR3 \
      input_R1.fastq.gz \
      input_R2.fastq.gz \
      result

Also, the read you shared does not align to a human reference.

@readID/1
GTGGGACAAGAGGATCAGGGTTCCCGCCTCTCAGTACGTCAGCAGAGGCATGGCGACCTTATCAGTCCCTTTGCGTTCCCGTACAATGTATTGTCTTCCTAAGACTGCTGTCCCATTTTTTTTTTTTTTT
+
ECFHEFDDFDFHCECCDFAGBDEBBGFCCGCEDECDGDBDDGCFBEIGGDCGBCEEFDDCECHDEDHHGBCCGGFCDFGHFDEGFEBGCE?CFCCDCH?BFBHEFCDDDFBC/EF@B?ABCD3DDDCDDC
DavideMaspero commented 1 week ago

I run the pipeline following your suggestion. It actually solved the problem of having multiple clones associated to the same barcode (thank you!).

Unfortunately, with the parameters I used in the alignment steps, it was possible to align about 52% of reads. Instead, with the new parameters, I got only 5.49% of reads correctly aligned.

For your convenience, I am reporting the two reports.

Old parameters

Analysis date: Sun Oct 06 12:22:17 CEST 2024
Input file(s): read_1.fq.gz,read_2.fq.gz
Output file(s): ./results.alignments.vdjca
Version: 4.7.0; built=Wed Aug 07 21:19:48 CEST 2024; rev=976ba14139; lib=repseqio.v5.1
Command line arguments: align -p rna-seq --tag-pattern ^(R1:*)\^(UMI:N{35}) --tag-max-budget 0 --threads 32 --species hsa -OsaveOriginalReads=true -OallowPartialAlignments=true -OvParameters.geneFeatureToAlign=VTranscriptWithout5UTRWithP --report results.align.report.txt --json-report results.align.report.json read_1.fq.gz read_2.fq.gz ./results.alignments.vdjca
Analysis time: 54.88m
Total sequencing reads: 176460041
Successfully aligned reads: 91791377 (52.02%)
Coverage (percent of successfully aligned):
  CDR3: 8495967 (9.26%)
  FR3_TO_FR4: 17880 (0.02%)
  CDR2_TO_FR4: 15 (0%)
  FR2_TO_FR4: 28 (0%)
  CDR1_TO_FR4: 49 (0%)
  VDJRegion: 16 (0%)
Alignment failed: no hits (not TCR/IG?): 72721041 (41.21%)
Alignment failed: no CDR3 parts: 586983 (0.33%)
Alignment failed: low total score: 11360640 (6.44%)
Overlapped: 0 (0%)
Overlapped and aligned: 0 (0%)
Overlapped and not aligned: 0 (0%)
Alignment-aided overlaps, percent of overlapped and aligned: 0 (NaN%)
No CDR3 parts alignments, percent of successfully aligned: 68089 (0.07%)
Partial aligned reads, percent of successfully aligned: 83227321 (90.67%)
Realigned with forced non-floating bound: 0 (0%)
Realigned with forced non-floating right bound in left read: 0 (0%)
Realigned with forced non-floating left bound in right read: 0 (0%)
TRA chains: 66551009 (72.5%)
TRA non-functional: 653314 (0.98%)
TRB chains: 25228858 (27.48%)
TRB non-functional: 1140339 (4.52%)
TRAD chains: 9885 (0.01%)
TRAD non-functional: 0 (0%)
IGH chains: 1624 (0%)
IGH non-functional: 47 (2.89%)
IGK chains: 1 (0%)
IGK non-functional: 0 (0%)
Trimming report:
  R1 reads trimmed left: 4598800 (2.61%)
  R1 reads trimmed right: 1597690 (0.91%)
  Average R1 nucleotides trimmed left: 0.08553564259910831
  Average R1 nucleotides trimmed right: 0.011795248307802444
Tag parsing report:
  Execution time: 0ns
  Total reads: 176460041
  Matched reads: 176460041 (100%)
  Projection +R1 +R2: 176460041 (100%)
  For variant 0:
    For projection +R1 +R2:
      R1:Left position: 0
      R1:Right position: 130
      UMI:Left position: 0
      UMI:Right position: 35
      Variants: 0
      Cost: 0
      R1 length: 130
      UMI length: 35
======================================

New parameters

Analysis date: Mon Oct 07 22:10:49 CEST 2024
Input file(s): read_1.fq.gz,read_2.fq.gz
Output file(s): result.alignments.vdjca
Version: 4.7.0; built=Wed Aug 07 21:19:48 CEST 2024; rev=976ba14139; lib=repseqio.v5.1
Command line arguments: align --report result.align.report.txt --json-report result.align.report.json --preset generic-amplicon-with-umi --save-output-file-names result.align.list.tsv --floating-left-alignment-boundary --floating-right-alignment-boundary C --rna --species hsa --tag-pattern ^(R1:*)\^(UMI:N{35}) --assemble-clonotypes-by CDR3 read_1.fq.gz read_2.fq.gz result.alignments.vdjca
Analysis time: 45.47m
Total sequencing reads: 176460041
Successfully aligned reads: 9695003 (5.49%)
Coverage (percent of successfully aligned):
  CDR3: 9456827 (97.54%)
  FR3_TO_FR4: 26 (0%)
  CDR2_TO_FR4: 42 (0%)
  FR2_TO_FR4: 56 (0%)
  CDR1_TO_FR4: 43 (0%)
  VDJRegion: 45 (0%)
Alignment failed: no hits (not TCR/IG?): 162965524 (92.35%)
Alignment failed: absence of V hits: 15720 (0.01%)
Alignment failed: absence of J hits: 3783794 (2.14%)
Overlapped: 0 (0%)
Overlapped and aligned: 0 (0%)
Overlapped and not aligned: 0 (0%)
Alignment-aided overlaps, percent of overlapped and aligned: 0 (NaN%)
No CDR3 parts alignments, percent of successfully aligned: 1241 (0.01%)
Partial aligned reads, percent of successfully aligned: 236935 (2.44%)
Realigned with forced non-floating bound: 0 (0%)
Realigned with forced non-floating right bound in left read: 0 (0%)
Realigned with forced non-floating left bound in right read: 0 (0%)
TRA chains: 4723135 (48.72%)
TRA non-functional: 676005 (14.31%)
TRB chains: 4970603 (51.27%)
TRB non-functional: 1143093 (23%)
IGH chains: 1261 (0.01%)
IGH non-functional: 49 (3.89%)
IGK chains: 3 (0%)
IGK non-functional: 1 (33.33%)
IGL chains: 1 (0%)
IGL non-functional: 0 (0%)
Trimming report:
  R1 reads trimmed left: 4598800 (2.61%)
  R1 reads trimmed right: 1597690 (0.91%)
  Average R1 nucleotides trimmed left: 0.08553564259910831
  Average R1 nucleotides trimmed right: 0.011795248307802444
Tag parsing report:
  Execution time: 0ns
  Total reads: 176460041
  Matched reads: 176460041 (100%)
  Projection +R1 +R2: 176460041 (100%)
  For variant 0:
    For projection +R1 +R2:
      R1:Left position: 0
      R1:Right position: 130
      UMI:Left position: 0
      UMI:Right position: 35
      Variants: 0
      Cost: 0
      R1 length: 130
      UMI length: 35
======================================

I can confirm that those sequences come from human sample. It's a new approach, so we are probably observing a high level of noise.

Thank you again!

mizraelson commented 1 week ago

Hi, so the difference is because of the -OallowPartialAlignments=true. These alignments do not contain the full CDR3 sequence, only a part of it. When you preserve it on the align step these alignments will still be filtered out on the assembly due to the lack of the clonal feature. In the command I have shared they are filtered out right away.

DMaspero commented 6 days ago

Thank you for all your help, it has been very helpful!