milaboratory / mixcr

MiXCR is an ultimate software platform for analysis of Next-Generation Sequencing (NGS) data for immune profiling.
https://mixcr.com
Other
316 stars 78 forks source link

generic-amplicon preset #1301

Closed bshim181 closed 10 months ago

bshim181 commented 11 months ago

Based on the guide of generic-amplicon preset, the reads are constructed as follows.

Screenshot 2023-08-10 at 11 50 56 AM

The read outputs that I am trying to process through MiXCR, however, looks like this. R2 has the UMI sequence and covers the C gene segments rather than part of the V segments.

R1 starts at the V and then extends into J segment. Is there a way to tweak the parameters so that C segments can be recognized?

Screenshot 2023-08-10 at 11 51 46 AM
mizraelson commented 11 months ago

HI, yes, of course, could you please share the size of the UMI?

bshim181 commented 11 months ago

Size of the UMI is 7 bp long. it is pretty short.

mizraelson commented 11 months ago

It's pretty short. I must say I have read this paper before and contacted the authors on the matter of sharing the data (because, as far as I'm concerned, the raw data is not publicly available), so I can tune the preset, but I never heard back from them. If you have raw data generated by this protocol that you can share with us (the same goes for the single-part data from this publication) - that would be of great help. Nevertheless, here is the command I suggest using, judging from the scheme:

mixcr analyze generic-amplicon-with-umi \
    --species hsa \
    --rna \
    --tag-pattern "^(R1:*)\(UMI:N{10})(R2:*)" \
    --floating-left-alignment-boundary \
    --floating-right-alignment-boundary C \
      input_R1.fastq.gz \
      input_R2.fastq.gz \
      result

Because the UMI is quite short, I suggest trying to include a few more letters from the TRBC/TRAC primer, which will at least increase the diversity two-fold.

bshim181 commented 11 months ago

Hello, I do have the bulk and single cell data generated from this protocol. I would probably have to converse with the data generator because the data that we have is a clinical data of origin. I will get back to you once I talked with the developer and get back to you.

in terms of the bulk data, would single pair of the Fastq file suffice? ( R1 and R2 ) Also for the single cell data, Would you need all fastq files for the entire batch(it will be 384 pairs of fastq files in total)?

mizraelson commented 11 months ago

A single pair of files will be enough for our purposes. In the case of Single-cell analysis, it's better to see the full picture, as the filtering process includes all cells. If needed, we can provide a secure SFTP server for the data transfer.

Nevertheless, I recommend you try the commands suggested and we can see how well it worked, as these generic presets should cover most cases.

bshim181 commented 11 months ago

the command above throws an error stating that, "Could not invoke public final void com.milaboratory.mixcr.cli.AlignMiXCRMixins.floatingRightAlignmentBoundary(java.lang.String) with /jsimonlab/users/bshim/BMS-Bulk-Reads/BMS-61_S1_L001_R1_001.fastq.gz (java.lang.IllegalArgumentException: Unknown point: /jsimonlab/users/bshim/BMS-Bulk-Reads/BMS-61_S1_L001_R1_001.fastq.gz)"

why might this be?

mizraelson commented 11 months ago

Please try the following:

mixcr analyze generic-amplicon-with-umi \
    --species hsa \
    --rna \
    --tag-pattern "^(R1:*)\(UMI:N{10})(R2:*)" \
    --floating-left-alignment-boundary \
    --floating-right-alignment-boundary C \
      input_R1.fastq.gz \
      input_R2.fastq.gz \
      result
bshim181 commented 11 months ago

I am also in the process of getting access to sample data for both single and bulk library which we can share to you. I will let you guys know as soon as possible.

mizraelson commented 10 months ago

Upon analyzing the bulk dataset, I see that as expected, such a short UMI sequence leads to a high number of distinct clones within a single UMI group, which in some cases makes it hard to assemble consensus. I tweaked the parameters in the example below to recover as many clones as possible.

mixcr analyze generic-amplicon-with-umi \
  --species hsa \
  --rna \
  --tag-pattern "^(R1:*)gaagcaga\^(UMI:N{11}) || ^(R1:*)taccagct\^(UMI:N{11})" \
  --floating-left-alignment-boundary \
  --floating-right-alignment-boundary C \
  -Massemble.consensusAssemblerParameters.assembler.maxIterations=10 \
  -Massemble.consensusAssemblerParameters.assembler.minRecordSharePerConsensus=0.01 \
  -Massemble.consensusAssemblerParameters.assembler.minRecursiveRecordShare=0.01 \
  -Massemble.consensusAssemblerParameters.assembler.maxConsensuses=10 \
  input_R1_001.fastq.gz \
  input_R2_001.fastq.gz \
  output

Nevertheless, it is strongly recommended using a longer UMI, as in this case it doesn't really mark unique molecules, thus de facto is not a true UMI. Alternatively, you can analyze the data ignoring the UMI sequence. In your case there is no need in R2 file at all then, as it doesn't cover anything but a portion of C gene.

mixcr analyze generic-amplicon \
  --species hsa \
  --rna \
  --tag-pattern "^(R1:*)gaagcaga || ^(R1:*)taccagct" \
  --floating-left-alignment-boundary \
  --floating-right-alignment-boundary C \
  input_R1_001.fastq.gz \
  output

Sincerely, Mark

bshim181 commented 10 months ago

Could you explain little bit about the tag pattern used here? are the 8bp sequences after ^(R1:*), index 1 and index 2 for every sample? What do the 8bp sequences exactly represent here? Also, single cell presets are still in a working progress I am assuming?

mizraelson commented 10 months ago

In your R1 files the reads have UMI and Illumina indices at the end. These 8bp is the small part of C gene at the very end of the payload sequence (that is most likely comes from the primer) that I use to trim artificial barcode sequences. The single-cell preset is still work in progress, I will get back to you with it later this week.

jxshi commented 10 months ago

Hi @mizraelson,

I recently read one paper entitled TCR sequencing and cloning methods for repertoire analysis and isolation of tumor-reactive TCRs. In this paper, they introduced one TCR sequencing method for RNA extracted from T cells under the name SEQTR. The library structure is [UMI 9 bases][VDJ][C constant region], and the sequencing strategy is SE150. I downloaded the raw sequencing files from GEO website and analyzed GSM7061297 (SRR23603384) with the following protocol:

# Step 1. Trim adaptor.
fastp -i SRR23603384_1.fastq.gz -o SRR23603384_trimmed_1.fastq.gz -w 8

# Step 2. Analyze the data with UMI assigned as the first 10 bases. 
# From the supplementary file of the paper, I learned that the 9-base UMI is HHHHHNNNN, 
# Then I calculated the presence of G in the first 9 bases of each trimmed fastq, it turned 
# out that the first base had a higher frequency of G. So I chose to use the first 10 bases
# as UMI. Maybe I should have chosen 1 to 10 bases as UMI?

mixcr analyze generic-amplicon-with-umi \
    --threads 16 \
    --species hsa \
    --rna \
    --rigid-left-alignment-boundary \
    --floating-right-alignment-boundary C \
    --tag-pattern '^(UMI:N{10})(R1:*)' \
    -Massemble.consensusAssemblerParameters.assembler.maxIterations=6 \
    -Massemble.consensusAssemblerParameters.assembler.minRecordSharePerConsensus=0.02 \
    -Massemble.consensusAssemblerParameters.assembler.minRecursiveRecordShare=0.1 \
    -Massemble.consensusAssemblerParameters.assembler.maxConsensuses=6 \
    ../fastqs/SRR23603384_trimmed_1.fastq.gz \
    output

# Alternative Step 2. Ignore UMI and run mixcr by trimming the first 10 bases.
# After read this post and several post discussing UMI, I think 9-base UMI is too short.

mixcr analyze generic-amplicon \
    --threads 16 \
    --species hsa \
    --library imgt \
    --rna \
    --rigid-left-alignment-boundary \
    --floating-right-alignment-boundary C \
    --tag-pattern '^N{10}(R1:*)' \
    ../fastqs/SRR23603384_trimmed_1.fastq.gz \
    noUMI

The qc output for Step 2 is:

  Successfully aligned reads:                           97.36% [OK]
  Off target (non TCR/IG) reads:                        0.27%  [OK]
  Reads with no V or J hits:                            2.36%  [OK]
  Reads with no barcode:                                0.0%   [OK]
  Alignments that do not cover CDR3:                    0.48%  [OK]
  Tag groups that do not cover CDR3:                    0.018% [OK]
  Barcode collisions in clonotype assembly:             86.56% [ALERT]
  Unassigned alignments in clonotype assembly:          53.29% [ALERT]
  Reads used in clonotypes:                             44.95% [ALERT]
  Alignments dropped due to low sequence quality:       1.75%  [OK]
  Clones dropped in post-filtering:                     0.0%   [OK]
  Alignments dropped in clones post-filtering:          0.0%   [OK]
  Reads dropped in tags error correction and filtering: 0.93%  [OK]
  UMIs artificial diversity eliminated:                 12.31% [OK]
  Reads dropped in UMI error correction and whitelist:  0.0%   [OK]
  Reads dropped in tags filtering:                      0.93%  [OK]

The qc output for Alternative step 2 is:

  Successfully aligned reads:                     97.36% [OK]
  Off target (non TCR/IG) reads:                  0.32%  [OK]
  Reads with no V or J hits:                      2.31%  [OK]
  Reads used in clonotypes:                       95.62% [OK]
  Alignments that do not cover CDR3:              0.48%  [OK]
  Alignments dropped due to low sequence quality: 2.10%  [OK]
  Clones dropped in post-filtering:               0.0%   [OK]
  Alignments dropped in clones post-filtering:    0.0%   [OK]

Then I compared the output files of TRB.tsv with the results published by the authors. I found there is one amino acid difference for the most abundant clones. For example, the first five line from the results published by the authors reads:

#CDR3_sequence  Count   TRBV    TRBJ    Frame   CDR3_aaseq  CDR3_length
TGCGCCAGCAGCCAAGATTCCGATCCCCAGGGGCTGTTTGCGGGAAACACCATATATTTTGGA 174591  hTRBV04-3   hTRBJ01-3   IN  CASSQDSDPQGLFAGNTIYFG   21
TGTGCCAGCAGCCAAGGGACAGGACGGTCTTCACCCCTCCACTTTGGG    158629  hTRBV03-1   hTRBJ01-6   IN  CASSQGTGRSSPLHFG    16
TGTGCCAGCTCACCGACAGGGGAGGCCACTGAAGCTTTCTTTGGA   127792  hTRBV18 hTRBJ01-1   IN  CASSPTGEATEAFFG 15
TGCCAGCAGCTCTTAGCGCAATCCGTTCTTCGGG  87563   hTRBV21 hTRBJ02-1   OUT _   _
TGTGCCAGCAGTTTCCCGGATACGCAGTATTTTGGC    80302   hTRBV28 hTRBJ02-3   IN  CASSFPDTQYFG    12

For the results from Step 2, the first five line reads:

cloneId readCount       readFraction    uniqueMoleculeCount     uniqueMoleculeFraction  targetSequences targetQualities allVHitsWithScore       allDHitsWithScore       allJHitsWithScore   allCHitsWithScore        allVAlignments  allDAlignments  allJAlignments  allCAlignments  nSeqCDR3        minQualCDR3     aaSeqCDR3       refPoints
0       162907.0        0.031750372110455366    17152   0.027187463840134162    TGCGCCAGCAGCCAAGATTCCGATCCCCAGGGGCTGTTTGCGGGAAACACCATATATTTT    [[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[ TRBV4-3*00(563.6)       TRBD1*00(30)    TRBJ1-3*00(458.2)       TRBC1*00(50.5)  347|365|384|0|18||180.0 16|22|36|27|33||30.0    24|42|70|42|60||180.0           TGCGCCAGCAGCCAAGATTCCGATCCCCAGGGGCTGTTTGCGGGAAACACCATATATTTT 58      CASSQDSDPQGLFAGNTIYF    :::::::::0:1:18:27:-4:-2:33:42:-4:60:::
3       82214.0 0.016023406561344676    8470    0.013425712379077446    TGTGCCAGCAGCCAAGGGACAGGACGGTCTTCACCCCTCCACTTT   [[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[   TRBV3-1*00(521.2),TRBV3-2*00(520)    TRBD1*00(35)    TRBJ1-6*00(437.8)       TRBC1*00(141.5) 347|363|384|0|16||160.0;347|363|384|0|16||160.0 13|20|36|16|23||35.0    29|45|73|29|45||160.0           TGTGCCAGCAGCCAAGGGACAGGACGGTCTTCACCCCTCCACTTT        58      CASSQGTGRSSPLHF :::::::::0:-1:16:16:-1:-4:23:29:-9:45:::
1       81189.0 0.01582363533350783     10114   0.016031600354426127    TGTGCCAGCAGTTACGGGACAGTCTCTGGAAACACCATATATTTT   [[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[   TRBV6-5*00(243.3)   TRBD1*00(35)     TRBJ1-3*00(498.5)       TRBC1*00(106.4) 347|362|384|0|15||150.0 12|19|36|15|22||35.0    20|42|70|23|45||220.0           TGTGCCAGCAGTTACGGGACAGTCTCTGGAAACACCATATATTTT   58  CASSYGTVSGNTIYF  :::::::::0:-2:15:15:0:-5:22:23:0:45:::
2       65077.0 0.01268342653067151     8582    0.013603242460123097    TGTGCCAGCAGTTACGTTGGGGGTGGCTACACCTTC    [[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[    TRBV6-5*00(242.5)       TRBD1*00(25)TRBJ1-2*00(408.6)        TRBC1*00(144.3) 347|362|384|0|15||150.0 18|23|36|18|23||25.0    27|40|68|23|36||130.0           TGTGCCAGCAGTTACGTTGGGGGTGGCTACACCTTC    58      CASSYVGGGYTF    :::::::::0:-2:15:18:-6:-1:23:23:-7:36:::
6       49634.0 0.009673604997516015    4385    0.00695061969093915     TGTGCCAGCAGTGACTGGGGGGGGCAGGGAGCTTTCTTT [[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[ TRBV6-1*00(209.9)       TRBD1*00(31)TRBJ1-1*00(378.3)        TRBC1*00(117.1) 347|361|384|0|14||140.0 12|21|36|20|29|SA15G|31.0       30|40|68|29|39||100.0           TGTGCCAGCAGTGACTGGGGGGGGCAGGGAGCTTTCTTT 58      CASSDWGGQGAFF:::::::::0:-3:14:20:0:-3:29:29:-10:39:::

For the results from Alternative step 2, the first five line reads:

cloneId readCount       readFraction    targetSequences targetQualities allVHitsWithScore       allDHitsWithScore       allJHitsWithScore       allCHitsWithScore       allVAlignments  allDAlignments       allJAlignments  allCAlignments  nSeqCDR3        minQualCDR3     aaSeqCDR3       refPoints
0       180598.0        0.01654797835251231     TGCGCCAGCAGCCAAGATTCCGATCCCCAGGGGCTGTTTGCGGGAAACACCATATATTTT    [[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[    TRBV4-3*01(572.3)    TRBD1*01(30)    TRBJ1-3*01(459.8)       TRBC1*01(50.5),TRBC1*02(50.5),TRBC1*03(50.5)    270|288|307|0|18||180.0 16|22|36|27|33||30.0    24|42|70|42|60||180.0   ;;      TGCGCCAGCAGCCAAGATTCCGATCCCCAGGGGCTGTTTGCGGGAAACACCATATATTTT 58      CASSQDSDPQGLFAGNTIYF    :::::::::0:1:18:27:-4:-2:33:42:-4:60:::
1       163591.0        0.014989647319825477    TGTGCCAGCAGCCAAGGGACAGGACGGTCTTCACCCCTCCACTTT   [[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[   TRBV3-1*01(536.4),TRBV3-2*01(535.8)     TRBD1*01(35) TRBJ1-6*02(438.3)       TRBC1*01(142.1),TRBC1*02(142.1),TRBC1*03(142.1) 270|286|307|0|16||160.0;270|286|307|0|16||160.0 13|20|36|16|23||35.0    29|45|73|29|45||160.0   ;;      TGTGCCAGCAGCCAAGGGACAGGACGGTCTTCACCCCTCCACTTT        58      CASSQGTGRSSPLHF :::::::::0:-1:16:16:-1:-4:23:29:-9:45:::
2       127701.0        0.011701089622222697    TGTGCCAGCTCACCGACAGGGGAGGCCACTGAAGCTTTCTTT      [[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[      TRBV18*01(532)  TRBD1*01(40)    TRBJ1-1*01(438.9)    TRBC1*01(153.4),TRBC1*02(153.4),TRBC1*03(153.4) 273|287|310|0|14||140.0 14|22|36|14|22||40.0    24|40|68|26|42||160.0   ;;      TGTGCCAGCTCACCGACAGGGGAGGCCACTGAAGCTTTCTTT      58  CASSPTGEATEAFF   :::::::::0:-3:14:14:-2:-2:22:26:-4:42:::
3       121440.0        0.011127401693978311    TGTGCCAGCAGTTACGGGACAGTCTCTGGAAACACCATATATTTT   [[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[   TRBV6-5*01(185.2),TRBV6-2*01(183.6),TRBV6-3*01(183.6)        TRBD1*01(35)    TRBJ1-3*01(499) TRBC1*01(110.7),TRBC1*02(110.7),TRBC1*03(110.7) 270|285|307|0|15||150.0;270|285|307|0|15||150.0;270|285|307|0|15||150.0 12|19|36|15|22||35.020|42|70|23|45||220.0    ;;      TGTGCCAGCAGTTACGGGACAGTCTCTGGAAACACCATATATTTT   58      CASSYGTVSGNTIYF :::::::::0:-2:15:15:0:-5:22:23:0:45:::
4       109079.0        0.009994778074583828    TGTGCCAGCAGTTACGTTGGGGGTGGCTACACCTTC    [[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[    TRBV6-5*01(197.9)       TRBD1*01(25),TRBD2*01(25)       TRBJ1-2*01(408.4)    TRBC1*01(147.9),TRBC1*02(147.9),TRBC1*03(147.9) 270|285|307|0|15||150.0 18|23|36|18|23||25.0;25|30|48|18|23||25.0       27|40|68|23|36||130.0   ;;      TGTGCCAGCAGTTACGTTGGGGGTGGCTACACCTTC 58      CASSYVGGGYTF    :::::::::0:-2:15:18:-6:-1:23:23:-7:36:::

I truly value your expertise and insight in this matter and I believe your perspective could be of great help.

Best, Jianxiang

mizraelson commented 10 months ago

Hi,

You are rigth – a 9 bp UMI is quite short. As such, we're seeing about half the reads being dropped due to multiple CDR3s being assigned to the same UMI. Considering the UMIs are attached to multiple V gene primers, a good way around might be to include a few nucleotides right after the UMI, potentially increasing diversity. I'd recommend giving this a go: --tag-pattern "^(UMI:N{15})(R1:*)" or maybe even longer to capture the difference between primers. If that's not cutting it, let me know, and we can tinker around with the parameters to try and save more reads. Although a non-UMI approach is also a good choice since MiXCR has very poverful error-correction algorithms even for data without barcodes.

As for the CDR3 discrepancy. In the paper they do include an extra amino acid from the FR4 (sourced from the J gene) within the CDR3. The reasoning behind this addition isn't entirely clear. While some researchers opt to exclude the initial and final amino acids from the CDR3 definition (e.i. IMGT), adding an extra one is a bit weird However, since this particular amino acid stems from the J gene – which both methods identify correctly – you can safely consider the clones equivalent.

For a quick comparison:

Check out this link, and you'll see that the terminal 'G' belongs to the FR4.

jxshi commented 10 months ago

Hi,

You are rigth – a 9 bp UMI is quite short. As such, we're seeing about half the reads being dropped due to multiple CDR3s being assigned to the same UMI. Considering the UMIs are attached to multiple V gene primers, a good way around might be to include a few nucleotides right after the UMI, potentially increasing diversity. I'd recommend giving this a go: --tag-pattern "^(UMI:N{15})(R1:*)" or maybe even longer to capture the difference between primers. If that's not cutting it, let me know, and we can tinker around with the parameters to try and save more reads. Although a non-UMI approach is also a good choice since MiXCR has very poverful error-correction algorithms even for data without barcodes.

As for the CDR3 discrepancy. In the paper they do include an extra amino acid from the FR4 (sourced from the J gene) within the CDR3. The reasoning behind this addition isn't entirely clear. While some researchers opt to exclude the initial and final amino acids from the CDR3 definition (e.i. IMGT), adding an extra one is a bit weird However, since this particular amino acid stems from the J gene – which both methods identify correctly – you can safely consider the clones equivalent.

For a quick comparison:

  • CASSQDSDPQGLFAGNTIYFG (from the paper)
  • CASSQDSDPQGLFAGNTIYF (MiXCR)

Check out this link, and you'll see that the terminal 'G' belongs to the FR4.

Thank you for your clarification of the "G" amino acid shown in the results of the manuscript.

I have both run the pipeline with set the first 15 bases as UMI and the first 25 bases as UMI. The results are slightly different. The results for the first 15 bases set as UMI is:

  Successfully aligned reads:                           97.51% [OK]
  Off target (non TCR/IG) reads:                        0.46%  [OK]
  Reads with no V or J hits:                            2.021% [OK]
  Reads with no barcode:                                0.0%   [OK]
  Alignments that do not cover CDR3:                    0.42%  [OK]
  Tag groups that do not cover CDR3:                    0.32%  [OK]
  Barcode collisions in clonotype assembly:             69.56% [ALERT]
  Unassigned alignments in clonotype assembly:          7.69%  [WARN]
  Reads used in clonotypes:                             85.67% [WARN]
  Alignments dropped due to low sequence quality:       6.13%  [OK]
  Clones dropped in post-filtering:                     0.0%   [OK]
  Alignments dropped in clones post-filtering:          0.0%   [OK]
  Reads dropped in tags error correction and filtering: 4.44%  [OK]
  UMIs artificial diversity eliminated:                 11.94% [OK]
  Reads dropped in UMI error correction and whitelist:  0.0%   [OK]
  Reads dropped in tags filtering:                      4.44%  [OK]

The results for the first 25 bp as UMI is:

  Successfully aligned reads:                           97.16% [OK]
  Off target (non TCR/IG) reads:                        1.66%  [OK]
  Reads with no V or J hits:                            1.17%  [OK]
  Reads with no barcode:                                0.0%   [OK]
  Alignments that do not cover CDR3:                    0.087% [OK]
  Tag groups that do not cover CDR3:                    0.041% [OK]
  Barcode collisions in clonotype assembly:             63.57% [ALERT]
  Unassigned alignments in clonotype assembly:          5.76%  [WARN]
  Reads used in clonotypes:                             85.98% [WARN]
  Alignments dropped due to low sequence quality:       7.88%  [OK]
  Clones dropped in post-filtering:                     0.0%   [OK]
  Alignments dropped in clones post-filtering:          0.0%   [OK]
  Reads dropped in tags error correction and filtering: 5.87%  [WARN]
  UMIs artificial diversity eliminated:                 12.21% [OK]
  Reads dropped in UMI error correction and whitelist:  0.0%   [OK]
  Reads dropped in tags filtering:                      5.87%  [WARN]

When the first 10 bases are ignored using the fore-mentioned Alternative step 2, the results shows:

  Successfully aligned reads:                     97.36% [OK]
  Off target (non TCR/IG) reads:                  0.32%  [OK]
  Reads with no V or J hits:                      2.31%  [OK]
  Reads used in clonotypes:                       95.62% [OK]
  Alignments that do not cover CDR3:              0.48%  [OK]
  Alignments dropped due to low sequence quality: 2.10%  [OK]
  Clones dropped in post-filtering:               0.0%   [OK]
  Alignments dropped in clones post-filtering:    0.0%   [OK]

Should I try to use longer bases to be used as UMI, or should I just ignore the first 10 bases? Thank you very much! Best, Jianxiang

mizraelson commented 10 months ago

Actually, the 15bp UMI looks much better; the number of unassigned alignments has dropped from 53% to 7.8%! Additionally, 85% of the reads are used in clonotype assembly.

I would suggest going with the 15bp UMI. While it's not perfect, it still allows you to leverage the UMI to correct the data effectively.

jxshi commented 10 months ago

Thank you for your clarification! Best, Jianxiang