Extra V regions - Githubissues

weshorton commented 8 years ago

Summary

There are a total of 31 V regions in the TCR Beta receptor locus. 11 of them are pseudogenes and 20 are transcribed genes (still need to confirm this). We only use primers for the 20 transcribed genes during our amplification. In our MiXCR clonotype output tables, there are a small subset of clonotypes that map to these pseudogenes.

Significance

We need to determine the source of this output in order to determine if we can confidently ignore them, and if not, what to do about them. There are two main theories as to how these pseudogenes are arising in our output files:

They actually exist in our samples and are being amplified by one of the 20 V primers
They don't exist in our samples, but due to sequence similarity with our 20 V genes, MiXCR is incorrectly identifying some of our 20 real genes as some of the 11 pseudogenes instead.
To Do
Summarize output - Patrick
- Link to initial summary files
- What are all of the extra V genes that are identified by MiXCR?
- What are the relative rates at which these V's are being identified?
Determine sequence similarity of pseudogenes and real genes - Dhaarini
Compare rates of identification with sequence similarity using following approach - Wes
Approach
Create correlation between rate and similarity
- If all the pseudogenes are identified at similar rates and have similar sequence similarity, assume to be MiXCR error.
- If all pseudogenes are identified at similar rates and have a wide range of sequence similarity, assume to exist in our samples.
- If pseudogenes are identified at different rates and have a range of sequence similarity, assume to be MiXCR error.
How to do correlation? linear model again? Can sequence similarity predict rate of identification during MiXCR run?

leyshock commented 8 years ago

Summary of V-regions in 151124_extra_segs.csv, found here.

Counts aggregated from "Best V Hit" column of .csv file.

leyshock commented 8 years ago

Adding some documents from Dhaarini: IMGT_nomenclature.pdf MalekFahamPatent.pdf TCRB_pseudogenes.docx

leyshock commented 8 years ago

Extracted from the Faham patent:

V_regions.xlsx

weshorton commented 8 years ago

Based on suggestion by DM, we took an in-depth look at V22 hits. Subsequent results are from the equivolume DNA151124LC batch.I used this script to extract the read IDs of all the sequences that assembled to V22 and subsequently their fastq read information (i.e. ID, sequence, q score).

The fastq reads were then re-aligned using MiXCR align and those alignments were exported using the pretty format.

Observations

Many of the newly exported alignments no longer align to V22, but align to V24 or V26
Newly exported alignments that still align to V22 according to MiXCR align to V26 according to BLAT

These observations suggest to me that the issue is a sequence-similarity issue and MiXCR is incorrectly identifying the reads, see bottom for more information..

Gene Sequence Similarity

Sequences taken from GenBank. I selected the V region for each gene and then compared the main body of the sequence (i.e. excluded the first fragment). I ran a sequence similarity alignment using BLAST, results are here. There doesn't seem to be a huge difference between V22-V24 and V22-V26 alignments and the V22-V1 alignment.

Moving Forward

Questions

Are we interested in any of the other pseudogenes? Julja mentioned comparing rate of pseudogene calls and sequence similarity. This analysis would back up current results.
Can increase V segment alignment minimums during MiXCR
- Not sure what this would do, seems like it wouldn't help us in the sense of correctly identifying V24 or V26 regions, but rather would just drop them instead of identifying them as V22.
- Also haven't looked at scores of the V22 alignments to see if they're potentially lower than average scores.
  Actions
Export alignments with saving all alignments and compare alignment scores for top and next best alignment.

weshorton commented 8 years ago

To Do

Make Milestone
Double check that bullet 3 from Moving Forward is completed. complete if not
Bullet 1 from Moving Forward
Determine with Dhaarini what steps to take and how this affects the biology of the question.

ohsu-comp-bio / tcrseq_normalization

Extra V regions #9

Summary

Significance

To Do

Approach

Observations

Gene Sequence Similarity

Moving Forward

Questions

Actions