ohsu-comp-bio / tcrseq_normalization

0 stars 0 forks source link

MiXCR Parameters #11

Open weshorton opened 8 years ago

weshorton commented 8 years ago

Summary

MiXCR performs an alignment as well as an assembly step during its process of identifying clonotypes. During assembly, a clustering method is utilized to attempt to overcome PCR and sequencing errors and build accurate counts of clonotypes. Is this method appropriate and how does it relate to depth of coverage for unique sequences? Refer to markdown for more detailed explanation of alignment and assembly steps.

Significance

We need an accurate proxy for depth of coverage in order to determine how T-cell concentration influences our results (#10)

To Do

  1. Review relevant MiXCR info, outlined below
  2. Provide recommendation for how to use clonotype counts as read depth proxy
    • Use as direct replacement
    • Adjust count based on scaling factor derived from clustering method

      Approach

  3. Review current parameters and summarize
    • See comment below
  4. Assess current clustering method - what happens to our data as we run it through MiXCR
    • Produce table of important values to compare - see here
      • Initial fastq read count
      • Total reads aligned based on V and J segments
      • Total clonotype count after assembly
      • Add as more are determined.
weshorton commented 8 years ago

MiXCR Review

Submit Parameters

align --loci TRB --species mmu --report align_report.txt input.fastq output.vdjca assemble --report assemble_report.txt input.vdjca output.clns

Align

Summary

Notable parameters we can change

Assemble

Summary

Notable parameters we can change

Overall To Do: summarize notable QC outputs, possibly change parameters and compare change in QC outputs.

weshorton commented 8 years ago

Example of Adaptive's primer specificity analysis. Potentially useful experiment for determining why most unaligned reads are due to lack of J region. See Align section of markdown

adaptive_specificity
weshorton commented 8 years ago

Update from 5/11 Working Session

Align

  1. Does MiXCR fail to recognize J regions because they are read in the reverse direction and MiXCR is reading from forward?
    • No, it can recognize in both directions
  2. Extract reads that fail to align, re-run through pipeline with relaxed parameters. Add results to markdown
  3. Make align procedure check for D segment prior to successful alignment
    • Currently not an option

Assemble

  1. Confirm how CDR3 regions are determined during clonotype identification
    • Waiting for response from developers
  2. Look at successfully aligned reads that fail to assemble to figure out cause
    • update with concrete areas to analyze after output is generated.
leyshock commented 8 years ago

5/31/2016 email response from MiXCR development team:

We have checked your data files and found that there are two main reasons why there are about 50% of dropped reads:

  1. it looks like that there is about 50% contamination by sequences of Cyprinus carpio (we have BLASTed few reads that were not aligned and found that almost all of them alignes to Cyprinus carpio )
  2. the mmu library does not look perfectly enriched by CDR3 containing regions; there is some contamination by other genomic sequences.

In general, it seems that there are some problems with library preparation protocol that should be addressed on the wet lab side.

Assembled clones look very odd but seems to be aligned correctly (too long VJ insertions and too many out of frame clones): I only saw something similar in the analysis of thymus derived samples.

Additionally, we recommend to add the following option on the align step in order to increase selectivity of alignments for such contaminated case:

mixcr align -OvParameters.parameters.floatingLeftBound=false Š
weshorton commented 8 years ago

Update from 6/1 Working Session

  1. 160107 is the worst of our batches. See here
  2. Pretty alignment exports are located on box
    • BLAT alignments from different sources and determine if MiXCR is correctly aligning them.
  3. MiXCR thresholds seem to not make sense, require further investiagation
    • Default minimum alignment score for J region is 40
    • had some reads fail to align due to no J hit, re-ran them with minimum score of 35, rescued them, but score was 115
      • should have been aligned the first time
weshorton commented 8 years ago

BLAT results of MiXCR exportAlignPretty

Batch 160107 with j absolute min set to 35

Note: All reads in this data set failed the default alignment run

Sample 2
  1. Read ID: 3404
    A) V and J alignments correspond to MiXCR output
    B) V is beginning of gene and J is end, chromosome 6 (positive strand)
    C) In-between aligns to many sequences, one with low score is on the same chromosome, but appears to be a different gene
    D) Interpretation: off-target amplification. Says has D hit.
  2. Read ID: 3405 A) No BLAT matches for V alignment (MiXCR says 13-1) B) J alignment matches C) In-between has no BLAT matches. D) Off-target amplification? Why do we have no hit for V primer?
  3. Read ID: 3410 A) V aligns to random gene on positive strand of chromosome 15 (Gtse1) B) J alignment matches C) Beginning of sequence up to V BLAT matches J2-2 D) Off-target amplification as well as forward priming by a J primer. Says has D hit.
  4. Read ID: 3416 A) V matches to positive strand chromosome 6 B) J matches to positive strand chromosome 6 C) In between matches dozens of results, one is on negative strand of chr 6, but different gene D) Another off-target amplification? Says has D hit.

Summary: All of these examples appear to be off-target amplification. The V and J alignments are only 18-25 base pairs long, suggesting that only the primers are aligning.

Sample 20
  1. Read ID: 2952 A) No matches to V alignment B) J matches C) BLAT between V and D matches random gene on positive strand chromosome 3, dozens of matches if extend to entire region between V and J D) More off-target amplification. What's up with V alignment though? Says has D hit.
  2. Read ID: 2951 A) V alignment is too short, extended further in either direction and matches to random gene. Beginning of read to end of V alignment matches same gene B) J alignment has no matches C) Between V and J matches to negative strand chromosome 15, Gtse1 D) Another off-target amplification, but even the primers aren't matching this time. Says has D hit.
  3. Read ID: 2939 A) V alignment has no matches B) Beginning up to V alignment has 5 matches, one is for J2-5 C) J is too short, but extend to left and hits J1-6 D) Possibly a primer-dimer of two J's. Sequence is a little long for that though.
  4. Read ID: 2936 A) V matches B) J matches C) In between matches Alp1 and RP23-291B1 D) More off-target amplification. Only primers align, between is random hit.

Summary: More of the same. Only the length of the primer is matching, and nothing else.

Batch 151124, standard parameters

Sample 1
  1. Read ID: 1134 A) V matches and is much longer than the primer (120 bases) B) J matches and is slightly longer than primer (32 bases) C) Two base pairs each between V-D alignments and D-J alignments D) Looks like a solid alignment to a true TCR
  2. Read ID: 1136 A) V is too short, beginning of read to end of V alignment matches J1-6 B) J aligns to MiXCR result C) Between V and J matches Gtse1 on negative strand of chromosome 15 D) Off-target amplification and forward priming by J primer. Says has a D hit
  3. Read ID: 1132 A) V matches, but is about length of primer B) J matches, but is about length of primer C) In between matches to many sequences, on is on negative strand of chromosome 6, but different gene D) Off-target amplification. Also says has a D hit.
Sample 10
  1. Read ID: 1096 A) V matches and is longer than primer (154 bp) B) J matches and is longer than primer (51) C) 1 bp between V and D, none between D and J D) Looks like another example of a good alignment
  2. Read ID: 1093 A) V matches B) J matches C) between V and D, and D and J both match to Rbfox3 on negative strand chromosome 11 D) Off-target amplification again. Why does MiXCR say there is a D region?
  3. Read Id: 1090 A) V matches B) J matches C) Between V and D aligns to Ubn2 on chromosome 6 (-), between D and J [aligns] to chromosome 2 (+). And together they align to dozens of hits D) Another off-target amplification. This one also supposedly has a D hit.
  4. Read ID: 1088 A) V matches B) J matches C) Between V and J match many things, one is on negative strand chromosome 6, but different gene D) Off-target amplification. Also supposedly has a D region.

Summary: More off-target amplification. Only the primers are aligning and nothing else.

Summary

Looks like quite a bit of off-target amplification. A few J primers may have forward priming ability as well. Hopefully the new PCR protocol will take care of a lot of this. These results also raise more questions about what MiXCR is doing. Why is it saying that there are D hits, when that sequence actually aligns to a completely unrelated gene?

Another thing to note is that this is a relatively small sample of our data. I'm going to look into ways to use the tab-separated outputs to try and quantify how many alignments to V and J are actually just aligning to the primer.

weshorton commented 8 years ago

Link to brief summary, and links to papers, of a few alternative TCR analysis programs

weshorton commented 8 years ago

Alignment Length Distributions

See alignment length report for analysis of V, J, and total alignment lengths in equivolume 151124 batch.

Based on report, we should not implement a size selection during library preparation. Report also suggests that MiXCR is doing its job in the sense that it is successfully assembling all of the alignments that are true CDR3 sequences.