Not deduplicating in training set

matsen commented 5 years ago

We're currently training on a set of sequences and then comparing the fit to the training data. It's good that we do in fact fit the training data, but perhaps that's not surprising given the flexibility of NNs.

@jjfeng has suggested that a better type of validation would be to use a second data set as a validation. It seems like we can do the same set of comparison using biologically-motivated summary statistics. The trick: how to do a train/test split? Let's stick to TCRs, so we don't have the clonal family problem.

Repertoires vary between individuals, so we'd want to train and test on samples from the same individual, and not separated by some big immune event like a vaccination. So we could split a single repertoire in half to do a comparison. Jean says:

If we do a simple split, will we still have a problem that sequences with the same V/D genes are still too similar? It’s a bit annoying that we can’t tell if two datasets are too similar… we only realized for motif when we tried to generate confidence intervals. Ha.

Discuss.

matsen commented 5 years ago

Taking notes from our meeting,

What metric to use for validation

we can use summary statistics as we are
we should be able to use a log likelihood (modulo #3)

What to compare

it would be nice to do a plot of validation log likelihood with NN complexity. Hopefully this will guide us to an appropriate complexity model
it would be nice to compare log likelihood to IGoR's Pgen

How to get a validation set

Naive idea is to just split on sequence identity
Better idea: do Kristian's clustering to separate out definitely-disjoint sets of sequences
(Erick note to self-- replicates or subsequent timepoints are not a great idea because there will be some overlap of sequences)

matsen commented 5 years ago

So, fast forward several months and we are using separate data sets for test. I think the conclusion from last time is that we don't worry about the fact that some sequences are shared between repertoires, and that if we were to drop these we would be skewing the distribution. E.g. see the image in #72 .

matsen commented 5 years ago

As a single data point, if we do this comparison:

python util.py sharedwith pipe_main/_ignore_results/2018-12-25-merged-seshadri-generalization/TB-1100_TCRB/TB-1100_TCRB.for-test.csv pipe_main/_ignore_results/2018-12-25-merged-seshadri-generalization/all-seshadri-01-TCRB/all-seshadri-01-TCRB.processed.training.csv x.csv

983 of 42152 sequences are shared, which is about 2.5%.

BrandenOlson commented 5 years ago

@matsen - That's right, from what I recall. We also agreed that we should't collapse to unique amino acid sequences, since if multiple nt sequences in a repertoire map to the same AA sequence, then such an AA sequence is likely meaningful and the loss should reflect that.

matsen commented 5 years ago

Thank you for the reminder! I will make this change.

matsen commented 5 years ago

@BrandenOlson by the same logic, if a sequence is present in multiple different repertoires and we are training on a merged collection of repertoires we should keep the replicate copies, right? Seems like an obvious corollary.

BrandenOlson commented 5 years ago

@matsen That sounds sensible to me!

matsen commented 5 years ago

@eharkins I think that you looked into the literature about this? If you have any links post them here, please.

eharkins commented 5 years ago

Here are some notes on what I found looking back in Slack, etc:

What makes sense to me (in my own words): "What we care most about is that the data accurately represent the true distribution of TCR sequences (in general) that we are trying to learn. If we feel that our sampled data is not representative of the distribution we are trying to learn (someone's repertoire), this is an independent problem of whether we should de-duplicate. In fact, de-duplicating seems like it would hurt our chance of representing the true distribution we are trying to learn, unless we sampled a rare TCR sequence and are giving it more probability to be generated than it should get. But again, in that case it seems like the limiting factor not enough samples / sequences to get a good representative dataset.

That being said, it seems like validation of generative models is not trivial, and so there isn't a recipe for doing it in general though there are some examples of how others have done it in a reasonable way. https://arxiv.org/pdf/1712.02311.pdf Does this apply to our situation?"

Sources: https://www.quora.com/Should-we-remove-duplicates-from-a-data-set-while-training-a-Machine-Learning-algorithm-shallow-and-or-deep-methods https://stats.stackexchange.com/questions/222297/do-examples-in-the-training-and-test-sets-have-to-be-independent https://arxiv.org/pdf/1712.02311.pdf https://arxiv.org/pdf/1707.02392.pdf

The last two (arxiv) are more about the question of how to validate generative models in general, if I remember right. The first two are more about the specific question you are discussing here. I guess I didn't encounter any official literature on that.

matsen commented 5 years ago

Closed by https://github.com/matsengrp/vampire/commit/b2d09ccadbdb1cacdef05e593fb13f90a16fa108

matsengrp / vampire