milaboratory / mixcr

MiXCR is an ultimate software platform for analysis of Next-Generation Sequencing (NGS) data for immune profiling.
https://mixcr.com
Other
322 stars 78 forks source link

Need help to interpret exportClones output #318

Closed xww52526 closed 6 years ago

xww52526 commented 6 years ago

Hello,

I have a T-cell sample with shallow sequencing, after mixcr v2.0.4 carried on align, assemble, clone report steps with default setting, I got results of following in clones report: Clone ID Clone count Clone fraction Clonal sequence(s) Clonal sequence quality(s) All V hits All D hits All J hits All C hits All V alignments All D alignments All J alignments All C alignments N. Seq. FR1 Min. qual. FR1 N. Seq. CDR1 Min. qual. CDR1 N. Seq. FR2 Min. qual. FR2 N. Seq. CDR2 Min. qual. CDR2 N. Seq. FR3 Min. qual. FR3 N. Seq. CDR3 Min. qual. CDR3 N. Seq. FR4 Min. qual. FR4 AA. Seq.FR1 AA. Seq.CDR1 AA. Seq.FR2 AA. Seq.CDR2 AA. Seq.FR3 AA. Seq.CDR3 AA. Seq.FR4 Ref. points 0 3 0.5 TGTGCCGCGTCTCGGGGCTGTGAGCCAAAAACATTCAGTACTTC GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG TRBV1800(16.7),TRBV21-100(16.7),TRBV10-300(15) TRBD200(27),TRBD100(26) TRBJ2-400(250) TRBC200(298),TRBC100(270) 273|279|310|0|6||30.0;277|283|314|0|6||30.0;270|276|307|0|6||30.0 11|22|48|8|19|SC14TSA19G|27.0;7|15|36|8|16|SC10T|26.0 20|42|70|22|44||110.0 ; TGTGCCGCGTCTCGGGGCTGTGAGCCAAAAACATTCAGTACTTC 38 CAASRGC_AKNIQYF :::::::::0::6:8:13::19:22:22:44::: 1 3 0.5 TGTGCTTGGAGAGCCCAGAAGCTGGTATTT GGGGGGGGGGGFGGGGGGGGGGGGGGGGFG TRAV1700(1322) TRAJ5400(255) TRAC*00(161.7) 264|270|297|0|6||30.0 29|49|80|10|30||100.0 TGTGCTTGGAGAGCCCAGAAGCTGGTATTT 37 CAWRAQKLVF :::::::::0::6:::::10::30:::

So, there are two clones, the first clone has 3 V hits: TRBV1800(16.7),TRBV21-100(16.7),TRBV10-3*00(15)

Could you explain the meaning of "*00" and the numbers in "()"? The MiXCR documents I found didn't cover explanation of output.

Why a single clone can have three different V hits? My guess is that, my clonal sequence contains at least one nucleotide with low quality, they were deferred for further processing by mapping procedure; then these deferred sequences are mapped onto already assembled clonotypes. Then the clustering algorithm played a role in V gene differentiation. What's the best parameter setting for shallow sequencing, so that a more consistent V hit could be reported?

Thanks very much!

weshorton commented 6 years ago

You should consider updating to the latest version (2.1.8) which has quite a few updates. I'll try to answer a few of your questions to the best of my knowledge, but I'm sure the developers will have more detailed answers.

The "All V Hits" column will output all V regions that have alignment scores greater than the alignment score threshold. You can get another column labelled "Best V Hit" by using the argument -vHit in your export command. I would suggest creating a custom export fields file, as mentioned in the export man page: mixcr exportClones --preset-file myFields.txt clones.clns clones.txt. It's likely that many of the fields in the default export command are not necessary for your work and few fields makes it much easier to take a peek at your results!

I believe that the "*00" is a placeholder for which V allele your sequence was aligned to. I've never had anything but "*00" in my results. The Gene Features and Anchor Points section in the docs references the IMGT naming conventions. If you take a look at TRB V10 on their site (or here) you'll notice it's denoted as TRBV1001. TRB V1, for example, has two alleles 01 and *02. IMGT has a lot of great information for understanding more about TCR alignments.

The numbers within the "()" appear to correspond to alignment scores. For example:

Export Clone Entry

Clone ID    Clone count All V hits  All D hits  All J hits  All V alignments    All D alignments    All J alignments
10200   3   TRBV29*00(581.3)    TRBD1*00(50)    TRBJ1-4*00(215) 270|285|307|0|15|ST282G|61.0    12|22|36|15|25||50.0    20|43|71|25|48||115.0

Export Alignment Entry

All V hits  All D hits  All J hits  All C hits  All V alignments    All D alignments    All J alignments
TRBV29*00(586)  TRBD1*00(50)    TRBJ1-4*00(215)     165|285|307|0|120|ST282G|586.0  12|22|36|120|130||50.0  20|63|71|130|173||215.0
TRBV29*00(572)  TRBD1*00(50)    TRBJ1-4*00(215)     165|285|307|0|120|SC197TST282G|572.0    12|22|36|120|130||50.0  20|63|71|130|173||215.0
TRBV29*00(586)  TRBD1*00(50)    TRBJ1-4*00(215)     165|285|307|0|120|ST282G|586.0  12|22|36|120|130||50.0  20|63|71|130|173||215.0

In the clone file, the (581.3) for V29 corresponds to the average of the 3 alignments in the alignment file, same with the D and J segments. I'm not sure why under the "All V alignments" column, the alignment score is 61 rather than 581.3 or 586 or something. That may be an error.

The newer versions of MiXCR do not export the "()" values. A standard V Hit would just be "TRBV1*00", for example. The latest version of MiXCR also has improved alignment capabilities, so I strongly encourage you to update! I, for example, was getting CDR3 sequences that were including the V segment outside of the conserved CDR3 start site, which is fixed in the latest version.

xww52526 commented 6 years ago

@weshorton Thanks a lot for your explanation! They're helpful! The alignment scores are indeed confusing. In your example, other than the (581.3) for V29 & 61 in "All V alignments" of Export Clone Entry; the 115 in "All J alignments" of Export Clone Entry doesn't match with TRBJ1-4*00(215) as well.

I just installed v2.1.8, and tried a small dataset, the "()" values still present in my results:

cloneId cloneCount cloneFraction clonalSequence clonalSequenceQuality allVHitsWithScore allDHitsWithScore allJHitsWithScore allCHitsWithScore allVAlignments allDAlignments allJAlignments
0 9 0.529411765 TGTGCTGTGAGTGATCTCGAACCGAACAGCAGTGCTTCCAAGATAATCTTT GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG TRAV8-400(867.7),TRAV8-200(826.7)   TRAJ3*00(258) TRAC*00(167.7) 270|303|304|0|33|SA286TST288GSC289ASA292CSC294ASA299C|81.0;270|286|304|0|16|ST274C|66.0 25|51|82|25|51||130.0
1 6 0.352941176 TGTGCCAGCAGTTTCTCGACCTGTTCGGCTAACTATGGCTACACCTTC GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG TRBV12-300(1423.3),TRBV12-400(1353.3) TRBD1*00(25) TRBJ1-2*00(240.3) TRBC100(288.5),TRBC200(260.5) 273|287|310|0|14||70.0;273|287|310|0|14||70.0 4|9|36|19|24||25.0 19|40|68|27|48||105.0
2 2 0.117647059 TGTGCCAGCAGTTTCTCGACCTGTTCGGCTAACTATGGCTACACCATC GGGGFGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGEG TRBV12-300(1421),TRBV12-400(1351) TRBD1*00(25) TRBJ1-2*00(231) TRBC100(274),TRBC200(246) 273|287|310|0|14||70.0;273|287|310|0|14||70.0 4|9|36|19|24||25.0 19|40|68|27|48|ST37A|91.0

Hope the developers could help clarify more.

weshorton commented 6 years ago

Looks like you're using the -vHitsWithScore parameter in your export command, which will give you that output. I just use -vHits and -vHit.

A potential explanation for the difference in the scores could have something to do with the two-step alignment process. On the assemble man page, they mention:

The final step is to align clonal sequences to reference V,D,J and C genes. Since the assemblingFeatures are different from those used in align, it is necessary to rebuild alignments for clonal sequences. This alignments are built by more accurate aligner (since all hits are known in advance); thus, better alignments will be built for each clonal sequence.

So it's possible that the score in the allVAlignments corresponds to the score produced by alignment in the align step, while the score in the allVHitsWithScore corresponds to the score produced by alignment in the assemble step. That's just a guess though! Maybe someone else can shed some light.

xww52526 commented 6 years ago

@weshorton thanks very much! You're right, I indeed use the default setting --present full, so -vHitsWithScore was used instead of -vHits. Your explanation for the score changes make sense to me, I suspect it also relevant to the clustering algorithm. Hope the developers could clarify more. Thanks again!

dbolotin commented 6 years ago

Hi!

Sorry for delay, too long Russian holidays combined with unexpected family tasks took all the time 😃

Wes is right about all the points 👍 :

Score values you have for V hints is extremely low, which means that you have only several nucleotides of V segment present in your data. Could you please post some information of the library prep. protocol? General advice in this case is to use longer reads (at least 100 bp, better 150 bp) and/or use paired-end sequencing.

P.S. Please upgrade to the latest MiXCR version. Many bugs and other issue were fixed since 2.0.4.

xww52526 commented 6 years ago

@dbolotin Thanks for the detailed explaination!

We're doing RNA-seq, use 5'RACE based approach for library preparation, samples are sequenced on the Illumina platform using 300 bp paired-end reads; which should fully capture the TCR variable region & part of the constant region.

I was using the MiXCR default pipeline steps with default parameters for align, assemble and export clone. I guess I should use the special steps designed for processing RNA-seq https://mixcr.readthedocs.io/en/latest/rnaseq.html But I'm not quite sure what's your definition of "repertoire extraction from non-enriched and randomly-shred libraries"; we've designed TCR specific primers, so the RNA-seq libraries we prepared are indeed target enriched. Dose the -OallowPartialAlignments=true parameter always work better for RNA-seq? And how does the extendAlignments step actually work?

We have sample specific barcodes in our reads, we currently don't trim them before MiXCR alignment, there are also so oligoGs; will such seuquences impact alignment accuracy? What QC software you recommend to use before MiXCR alignment? You mentioned Scorevalues you have for V hints is extremely low, which means that you have only several nucleotides of V segment present in your data . Do you have some documentation about the score scales -- for example, how many nucleotides aligned are equivalent to score of 10, 20, 40 etc. The default absoluteMinScore is 40, what's the corresponding nucleotides length?

Thanks very much!

dbolotin commented 6 years ago

If your library was amplified with TCR-specific primer, your library is indeed enriched with target molecules and has a consistent sequence structure among reads (like all R2 reads start from C gene, and all R1 sequences contain 5'URT and some portion of coding sequence, or vice versa, or something similar to this). The term RNA-Seq was used in a more conventional sense in the docs, in this terms your library is not RNA-Seq, although sample preparation uses mRNA as a starting material. You don't need neither assemblePartial nor extendAlignments steps.

With default parameters, alignment accuracy should not be affected by the presence of any non-TCR subsequences (like adapters etc.) in the data. However, for the good quality data, trimming adapter sequences and using -p rna-seq on the alignment stage sometimes gives slightly better results (may give you extra ~3% of aligned sequences). Your data, on the other hand, seems to have some additional problems.

I recommend running something like fastqc for the data before analysis, for possible sequencing problems, and extracting several raw alignments to debug possible sample preparation issues, like this:

mixcr exportAlignmentsPretty -n 10 alignments.vdjca

Alignment scoring is calculated using tradition scheme (if kaligner2 is not used, linear gap scoring is used everywhere, if kaligner2 is used affine is used). The match/mismatch/gap scores.penalties are listed here: match = 5, mismatch = -9, gapPenalty = -12.