matsengrp / vampire

🧛 Deep generative models for TCR sequences 🧛
Apache License 2.0
16 stars 4 forks source link

OLGA gives surprisingly low probabilities to TRBJ1-6 and TRBJ2-7 #80

Closed matsen closed 5 years ago

matsen commented 5 years ago

TRBJ1-6 just gets lower than expected Pgen, while most of the time TRBJ2-7 seems normal but sometimes gets zero Pgen.

matsen commented 5 years ago

Here's a motivating picture: image

matsen commented 5 years ago

Email sent to Thierry and Aleks:


Dear Aleks and Thierry--

I've been continuing to work with OLGA, which has been a pleasure.

However, there are a couple of things about the default model that seemed surprising to us:

  1. Low probability for TRB1-6

Here we can see that in a pool of a million sequences the default OLGA model generates zero TRB1-6 sequences:

(olga) flyx » olga-generate_sequences --humanTRB -n 1e6 -o olga-1e6.tsv ... (olga) flyx » grep -c 1-6 olga-1e6.tsv 0

And as a positive control, we see lots of 2-7:

(olga) flyx olga/for-france » grep -c 2-7 olga-1e6.tsv 215317

If we look at the OLGA model it does appear in model_params.txt:

%TRBJ1-601;CTCCTATAATTCACCCCTCCACTTTGGGAATGGGACCAGGCTCACTGTGACAG;5 %TRBJ1-602;CTCCTATAATTCACCCCTCCACTTTGGGAACGGGACCAGGCTCACTGTGACAG;6

However, if we understand the model format correctly the 01 allele gets zero probability:

[j_choice,5]

%0,0,0

[j_choice,6]

%0.757297,6.26266e-06,0.242697

  1. Missing TRBJ2-7*02

OLGA assigns zero probability to CASSEGYEQYV TRBV2 TRBJ2-7:

(olga) flyx ~ » cat 02.tsv
CASSEGYEQYV TRBV2 TRBJ2-7 (olga) flyx ~ » olga-compute_pgen --human_T_beta -i 02.tsv CASSEGYEQYV 0.0

This is a TCRB that should have high probability if you acknowledge the 02 allele that ends with YV, which isn't a rare allele.

As a positive control, we can see that the version using the 01 allele does have high probability:

(olga) flyx ~ » cat 01.tsv
CASSEGYEQYF TRBV2 TRBJ2-7 (olga) flyx ~ » olga-compute_pgen --human_T_beta -i 01.tsv CASSEGYEQYF 2.4370209357248395e-06

The 02 allele does appear in model_params.txt:

%TRBJ2-7*02;CTCCTACGAGCAGTACGTCGGGCCGGGCACCAGGCTCACGGTCACAG;14

And in model_marginals.txt 14 gets a reasonable probability:

[j_choice,14]

%0.070029,0.0827931,0.847178

however, it's not in the anchors file (in fact, no secondary allele is present in the anchor file).

Can you help us out on this front?

Thank you for your patience,

Erick