matsengrp / vampire

🧛 Deep generative models for TCR sequences 🧛
Apache License 2.0
16 stars 4 forks source link

Uneven split #88

Open matsen opened 5 years ago

matsen commented 5 years ago

Right now I cut the CDR3 exactly in half. But here's a plot of the length of the untrimmed genes:

image

It looks like we'd probably do better keeping the V on one side and the J on the other if we put 40% on the V side and 60% on the J.

matsen commented 5 years ago
df = read.csv('/home/matsen/Downloads/repos/vampire/vampire/data/germline-cdr3-aas.csv', stringsAsFactors=FALSE)
df$length = nchar(df$sequence)

ggplot(df, aes(length, fill=locus)) + geom_histogram()
krdav commented 5 years ago

I disagree with that (:face_with_rolling_eyes:) because this is suppose to split residues between the anchor residues i.e. looking at a structure half of the loop should be left aligned and half should be right aligned. Splitting on V/J gene contribution to CDR3 does not guarantee that.

matsen commented 5 years ago

So what you are saying is that you think that there is meaningful structural homology in the middle 20% of the CDR3? 🤔

krdav commented 5 years ago

I am not sure what you mean by "meaningful structural homology".

Maybe an example can make my point more clear. Here are three CDR3 sequences. A = anchor residue, V = residue from V gene, J = residue from V gene: AVVVVVJJJA AVVJJJJJJA AVVVVVJA

I suggest these are split into: AVVVV---VJJJA AVVJJ---JJJJA AVVV-----VVJA

You suggest they are split into: AVVVVV---JJJA AVV---JJJJJJA AVVVVV-----JA

Have we had the true protein structures of these and aligned them all to the anchor residue, I argue that the per-position distance, in 3d world, would be smaller for my alignment. To get the "per-position distance" walk along the alignment, at position X take the residues and map them back onto their protein structure, then make all pairwise distance comparisons and take the mean.

By making a split like this: AVVVVV-----JA You indicate that the last V residue is far from the J-anchor and as a consequence, when comparing to longer CDR3 sequences, it might be in the same position as a residue in the middle between the two anchor residues.

matsen commented 5 years ago

Ah, sorry to be unclear. I'm suggesting a constant 40% / 60% split. Something like this:

AVVV---VVJJJA
AVVJ---JJJJJA
AVV-----VVVJA

with the logic that generally the J contribution is a little more than the V contribution.

krdav commented 5 years ago

Okay, this changes things slightly, but I will still argue that a 50/50 split will give the smallest per-position distance - which is what we want. If we did not care about this at all we should just left align everything.

It is hard to see the problem with a 40/60 split because it is already so close to a 50/50 split. The problem with 40/60 is that it is not symmetric. To see the problem more clearly let's look at an extreme case with one short and one long CDR3 and a more extreme 10/90 split: AVV---VVVVVVVVJJJJJJJJJJA AV-------------VVVVJJJJJA

Take position 13 in the alignment. For the first sequence this is in the middle of the long CDR3, far away from the V-anchor. For the second sequence this is the second residue after the V-anchor. Have we had the protein structure I guarantee you that those two residue would be far away from each other.

matsen commented 5 years ago

This is all a minor point, so we should only continue if it's fun. So...

a 50/50 split will give the smallest per-position distance - which is what we want.

Is that our only objective? I'd say that having the V's on one side and the J's on the other allows us to learn the rules of VDJ recombination more easily. I'm not convinced that alignment of residues in the middle of CDR3s of different length is actually so meaningful from a structural homology perspective.

If we did not care about this at all we should just left align everything.

No, not at all. If we left aligned everything, then we'd lose the real homology between the J genes for sequences of different length.

krdav commented 5 years ago

Is that our only objective? I'd say that having the V's on one side and the J's on the other allows us to learn the rules of VDJ recombination more easily.

Maybe, but following that argument we should be splitting on V/J gene border and not a hard 40/60 threshold. Also we don't even know when V starts and J ends - we impute it from alignment. But even if we had the true V/J start/end I still think structurally justified 50/50 splitting is better.

Also, I do think there is a meaningful structural difference:

screen shot 2019-02-03 at 6 06 11 pm

Granted, the difference gets smaller the closer we get to 50/50, so 40/60 is not far from that.

matsen commented 5 years ago

Hm, it seems like you're wanting to take this argument to extremes. That's not what I'm proposing. Also, I'm not proposing anything other than a fixed split, ever.

Given non-equal-sized building blocks, it seems impossible that 50/50 would be the optimal split. Perhaps it's 49/51, but I don't see how 50/50 can be optimal. I'd think that structural homology is strongest in germline-gene-encoded regions, so that a slight modification would actually improve structural homology.

krdav commented 5 years ago

Well, the only reason I took it to the extreme was to show how unsymmetrical split breaks.

"non-equal-sized building blocks" the amino acid backbone is actually equally sized (with the slight exception of proline which is a bit more rotationally constrained).

I didn't grab the symmetry concept out of nowhere. The AHo numbering also uses two anchor residues and splits insertion residues between them. https://www.sciencedirect.com/science/article/pii/S0022283601946625 And this is their reason: "it places the alignment gaps in a way that minimizes the average deviation from the averaged structure of the aligned domains"

I don't know if this is the kind of structural homology you are referring to?

Ultimately, this is a theoretical argument, but I will completely surrender to your argument if you can show me that this is improving any of the empirical metrics. My prediction is that a 40/60 split wont really do anything.

matsen commented 5 years ago

Yes, I knew all of this would just come down to "well, we'll see!"

Machine learning... 😬

krdav commented 5 years ago

Haha, feed into black box, watch what comes out, present it like you knew all along.