weshorton commented 8 years ago

Summary

MiXCR performs an alignment as well as an assembly step during its process of identifying clonotypes. During assembly, a clustering method is utilized to attempt to overcome PCR and sequencing errors and build accurate counts of clonotypes. Is this method appropriate and how does it relate to depth of coverage for unique sequences? Refer to markdown for more detailed explanation of alignment and assembly steps.

Significance

We need an accurate proxy for depth of coverage in order to determine how T-cell concentration influences our results (#10)

To Do

Review relevant MiXCR info, outlined below
Provide recommendation for how to use clonotype counts as read depth proxy
- Use as direct replacement
- Adjust count based on scaling factor derived from clustering method
  Approach
Review current parameters and summarize
- See comment below
Assess current clustering method - what happens to our data as we run it through MiXCR
- Produce table of important values to compare - see here
  - Initial fastq read count
  - Total reads aligned based on V and J segments
  - Total clonotype count after assembly
  - Add as more are determined.

weshorton commented 8 years ago

MiXCR Review

Submit Parameters

align --loci TRB --species mmu --report align_report.txt input.fastq output.vdjca assemble --report assemble_report.txt input.vdjca output.clns

Align

Summary

Align raw sequencing reads to reference V, D, J and C genes of T-cell receptor
- Uses GenBank reference library, although IMGT library can be used
QC information located in summary and analysis file for each batch (eqvuivol_DNA160107)
- shows total reads from file, successfully aligned reads, and more
- tells why read was unaligned - absence of V or J hit, or low total score
- What does this mean for us?
- Are these non-TCR sequences?
- Off-target amplification?
- do we need to tweak alignment parameters?
To Do: summarize aligned/unaligned and why in a table, produce boxplots of distribution. +Located here

Notable parameters we can change

--dff-loci accepts alignments with different loci of V and J genes.
- Don't think we want this - will align to non-TCRB loci I believe
can change minimum scoring for alignments as well
can change the features to align to

Assemble

Summary

Extract specific gene regions (CDR3) from alignments and build a set of clones
- If alignment has CDR3 region (clonal sequence), it's kept
- Defers reads with at least one bad quality nucleotide and drops reads with more than .7% bad nucleotides
- core clonotypes built by aggregating equivalent clonal sequences and summing counts
- deferred clonotypes are mapped to core clonotypes where possible
- cluster similar clonotypes together, levels in cluster based on clonotype abundance
- align heads of clusters back to V, D, and J genes
QC information located in summary and analysis file for each batch (eqvuivol_DNA160107)
- Reads used, reads for core clonotypes, reads deferred and mapped, reads dropped, etc.
To Do: summarize read usage in table, produce boxplots of distributions

Notable parameters we can change

addReadCountsOnClustering - default is for count of head clonotype of each cluster to be final count, but can switch to aggregate all the counts for children clone so final count is sum of all counts within a particular cluster
allowedMutationsInNRegions - default to allow 1 mutation between N regions of clonal sequences for clustering. could decrease to zero to avoid clustering I think.
clusteringFilter.specificMutationProbability controls relative counts between levels in a cluster

Overall To Do: summarize notable QC outputs, possibly change parameters and compare change in QC outputs.

weshorton commented 8 years ago

Example of Adaptive's primer specificity analysis. Potentially useful experiment for determining why most unaligned reads are due to lack of J region. See Align section of markdown

weshorton commented 8 years ago

Update from 5/11 Working Session

Align

Does MiXCR fail to recognize J regions because they are read in the reverse direction and MiXCR is reading from forward?
- No, it can recognize in both directions
Extract reads that fail to align, re-run through pipeline with relaxed parameters. Add results to markdown
Make align procedure check for D segment prior to successful alignment
- Currently not an option

Assemble

Confirm how CDR3 regions are determined during clonotype identification
- Waiting for response from developers
Look at successfully aligned reads that fail to assemble to figure out cause
- update with concrete areas to analyze after output is generated.

leyshock commented 8 years ago

5/31/2016 email response from MiXCR development team:

We have checked your data files and found that there are two main reasons why there are about 50% of dropped reads:

it looks like that there is about 50% contamination by sequences of Cyprinus carpio (we have BLASTed few reads that were not aligned and found that almost all of them alignes to Cyprinus carpio )

the mmu library does not look perfectly enriched by CDR3 containing regions; there is some contamination by other genomic sequences.

In general, it seems that there are some problems with library preparation protocol that should be addressed on the wet lab side.

Assembled clones look very odd but seems to be aligned correctly (too long VJ insertions and too many out of frame clones): I only saw something similar in the analysis of thymus derived samples.

Additionally, we recommend to add the following option on the align step in order to increase selectivity of alignments for such contaminated case:
mixcr align -OvParameters.parameters.floatingLeftBound=false Š

weshorton commented 8 years ago

Update from 6/1 Working Session

160107 is the worst of our batches. See here
Pretty alignment exports are located on box
- BLAT alignments from different sources and determine if MiXCR is correctly aligning them.
MiXCR thresholds seem to not make sense, require further investiagation
- Default minimum alignment score for J region is 40
- had some reads fail to align due to no J hit, re-ran them with minimum score of 35, rescued them, but score was 115
  - should have been aligned the first time

weshorton commented 8 years ago

BLAT results of MiXCR exportAlignPretty

Batch 160107 with j absolute min set to 35

Note: All reads in this data set failed the default alignment run

Sample 2

Read ID: 3404
A) V and J alignments correspond to MiXCR output
B) V is beginning of gene and J is end, chromosome 6 (positive strand)
C) In-between aligns to many sequences, one with low score is on the same chromosome, but appears to be a different gene
D) Interpretation: off-target amplification. Says has D hit.
Read ID: 3405 A) No BLAT matches for V alignment (MiXCR says 13-1) B) J alignment matches C) In-between has no BLAT matches. D) Off-target amplification? Why do we have no hit for V primer?
Read ID: 3410 A) V aligns to random gene on positive strand of chromosome 15 (Gtse1) B) J alignment matches C) Beginning of sequence up to V BLAT matches J2-2 D) Off-target amplification as well as forward priming by a J primer. Says has D hit.
Read ID: 3416 A) V matches to positive strand chromosome 6 B) J matches to positive strand chromosome 6 C) In between matches dozens of results, one is on negative strand of chr 6, but different gene D) Another off-target amplification? Says has D hit.

Summary: All of these examples appear to be off-target amplification. The V and J alignments are only 18-25 base pairs long, suggesting that only the primers are aligning.

Sample 20

Read ID: 2952 A) No matches to V alignment B) J matches C) BLAT between V and D matches random gene on positive strand chromosome 3, dozens of matches if extend to entire region between V and J D) More off-target amplification. What's up with V alignment though? Says has D hit.
Read ID: 2951 A) V alignment is too short, extended further in either direction and matches to random gene. Beginning of read to end of V alignment matches same gene B) J alignment has no matches C) Between V and J matches to negative strand chromosome 15, Gtse1 D) Another off-target amplification, but even the primers aren't matching this time. Says has D hit.
Read ID: 2939 A) V alignment has no matches B) Beginning up to V alignment has 5 matches, one is for J2-5 C) J is too short, but extend to left and hits J1-6 D) Possibly a primer-dimer of two J's. Sequence is a little long for that though.
Read ID: 2936 A) V matches B) J matches C) In between matches Alp1 and RP23-291B1 D) More off-target amplification. Only primers align, between is random hit.

Summary: More of the same. Only the length of the primer is matching, and nothing else.

Batch 151124, standard parameters

Sample 1

Read ID: 1134 A) V matches and is much longer than the primer (120 bases) B) J matches and is slightly longer than primer (32 bases) C) Two base pairs each between V-D alignments and D-J alignments D) Looks like a solid alignment to a true TCR
Read ID: 1136 A) V is too short, beginning of read to end of V alignment matches J1-6 B) J aligns to MiXCR result C) Between V and J matches Gtse1 on negative strand of chromosome 15 D) Off-target amplification and forward priming by J primer. Says has a D hit
Read ID: 1132 A) V matches, but is about length of primer B) J matches, but is about length of primer C) In between matches to many sequences, on is on negative strand of chromosome 6, but different gene D) Off-target amplification. Also says has a D hit.

Sample 10

Read ID: 1096 A) V matches and is longer than primer (154 bp) B) J matches and is longer than primer (51) C) 1 bp between V and D, none between D and J D) Looks like another example of a good alignment
Read ID: 1093 A) V matches B) J matches C) between V and D, and D and J both match to Rbfox3 on negative strand chromosome 11 D) Off-target amplification again. Why does MiXCR say there is a D region?
Read Id: 1090 A) V matches B) J matches C) Between V and D aligns to Ubn2 on chromosome 6 (-), between D and J [aligns] to chromosome 2 (+). And together they align to dozens of hits D) Another off-target amplification. This one also supposedly has a D hit.
Read ID: 1088 A) V matches B) J matches C) Between V and J match many things, one is on negative strand chromosome 6, but different gene D) Off-target amplification. Also supposedly has a D region.

Summary: More off-target amplification. Only the primers are aligning and nothing else.

Summary

Looks like quite a bit of off-target amplification. A few J primers may have forward priming ability as well. Hopefully the new PCR protocol will take care of a lot of this. These results also raise more questions about what MiXCR is doing. Why is it saying that there are D hits, when that sequence actually aligns to a completely unrelated gene?

Another thing to note is that this is a relatively small sample of our data. I'm going to look into ways to use the tab-separated outputs to try and quantify how many alignments to V and J are actually just aligning to the primer.

weshorton commented 8 years ago

Link to brief summary, and links to papers, of a few alternative TCR analysis programs

weshorton commented 8 years ago

Alignment Length Distributions

See alignment length report for analysis of V, J, and total alignment lengths in equivolume 151124 batch.

Based on report, we should not implement a size selection during library preparation. Report also suggests that MiXCR is doing its job in the sense that it is successfully assembling all of the alignments that are true CDR3 sequences.

ohsu-comp-bio / tcrseq_normalization

MiXCR Parameters #11

Summary

Significance

To Do

Approach

MiXCR Review

Submit Parameters

Align

Summary

Notable parameters we can change

Assemble

Summary

Notable parameters we can change

Update from 5/11 Working Session

Align

Assemble

Update from 6/1 Working Session

BLAT results of MiXCR exportAlignPretty

Batch 160107 with j absolute min set to 35

Sample 2

Sample 20

Batch 151124, standard parameters

Sample 1

Sample 10

Summary

Alignment Length Distributions