milaboratory / mixcr

MiXCR is an ultimate software platform for analysis of Next-Generation Sequencing (NGS) data for immune profiling.
https://mixcr.com
Other
323 stars 78 forks source link

V/D/J annotations with *00 allelic information #721

Closed zktuong closed 2 years ago

zktuong commented 2 years ago

Hi MiXCR team,

i'm trying to reannotate some short TRB nucleotide sequences (immunoSeq data) with MiXCR version 3/4 and was wondering if i'm doing anything wrong:

I created a small fasta file from the nucleotide sequences:

>test1
CGCACACAGCAGGAGGACTCGGCCGTGTATCTCTGTGCCAGCAGCTTAGTGGCAGGGATACTCTCAACCTGGCAGTTCTTCGGGCCA
>test2
CCGACAGCTTTCTATCTCTGTGCCAGTAGTACCACAGGGTGGGGAGTCTTCCGAAGAGGAGAAGAACATGGGCAGTTCTTCGGGCCA
>test3
CTTGGAGATCCAGTCCACGGAGTCAGGGGACACAGCACTGTATTTCTGTGCCAGCAGTCGGGACAGCCTGAGCAGTTCTTCGGGCCA

And ran it in shotgun mode

mixcr analyze shotgun -s hsa --starting-material DNA --receptor-type trb test.fa test

When I looked at the output file, all the V/D/J calls were return with *00.

import pandas as pd
df = pd.read_csv('test.clonotypes.TRB.txt', sep ='\t')
df[['aaSeqCDR3', 'allVHitsWithScore', 'allDHitsWithScore', 'allJHitsWithScore']]
               aaSeqCDR3                allVHitsWithScore allDHitsWithScore allJHitsWithScore
0  CASSTTGWGVFRRGEEHGQFF                   TRBV19*00(155)      TRBD1*00(30)    TRBJ2-1*00(84)
1       CASSLVAGILSTWQFF  TRBV7-2*00(245),TRBV7-8*00(229)      TRBD1*00(25)    TRBJ2-1*00(80)
2           CASSRD_PEQFF                 TRBV21-1*00(285)      TRBD1*00(40)    TRBJ2-1*00(95)

Is this behavior expected? What does it mean when it's *00?

mizraelson commented 2 years ago

This outcome is totally fine. 00 - is the number of allelic variant. Which is absent for TRBV, thus it's 00. You can export with -vGenes -jGenes -dGenes, that way you will get gene names without *00.

zktuong commented 2 years ago

thanks @mizraelson

00 - is the number of allelic variant. Which is absent for TRBV, thus it's 00.

yes i understand that - however the table at IMGT says that there would be allelic information for the V/D/J genes at the TRB locus and none of them are *00. Was wondering why MiXCR returns with *00 - does it mean that MiXCR couldn't distinguish the allelic variant from my short sequences, hence it just returns *00? Whilst that could potentially make sense for the V and J genes, the D gene should be entirely within the input sequences.

mizraelson commented 2 years ago

Yes, you can use IMGT reference instead of MiXCR build-in reference, and you will see allelic variants. But we urge to be really cautious about TCR alleles data in IMGT reference.

zktuong commented 2 years ago

thank you! that clarifies a lot. I'll close this issue now.