Closed matsen closed 5 years ago
Taking notes from our meeting,
So, fast forward several months and we are using separate data sets for test. I think the conclusion from last time is that we don't worry about the fact that some sequences are shared between repertoires, and that if we were to drop these we would be skewing the distribution. E.g. see the image in #72 .
As a single data point, if we do this comparison:
python util.py sharedwith pipe_main/_ignore_results/2018-12-25-merged-seshadri-generalization/TB-1100_TCRB/TB-1100_TCRB.for-test.csv pipe_main/_ignore_results/2018-12-25-merged-seshadri-generalization/all-seshadri-01-TCRB/all-seshadri-01-TCRB.processed.training.csv x.csv
983 of 42152 sequences are shared, which is about 2.5%.
@matsen - That's right, from what I recall. We also agreed that we should't collapse to unique amino acid sequences, since if multiple nt sequences in a repertoire map to the same AA sequence, then such an AA sequence is likely meaningful and the loss should reflect that.
Thank you for the reminder! I will make this change.
@BrandenOlson by the same logic, if a sequence is present in multiple different repertoires and we are training on a merged collection of repertoires we should keep the replicate copies, right? Seems like an obvious corollary.
@matsen That sounds sensible to me!
@eharkins I think that you looked into the literature about this? If you have any links post them here, please.
Here are some notes on what I found looking back in Slack, etc:
What makes sense to me (in my own words): "What we care most about is that the data accurately represent the true distribution of TCR sequences (in general) that we are trying to learn. If we feel that our sampled data is not representative of the distribution we are trying to learn (someone's repertoire), this is an independent problem of whether we should de-duplicate. In fact, de-duplicating seems like it would hurt our chance of representing the true distribution we are trying to learn, unless we sampled a rare TCR sequence and are giving it more probability to be generated than it should get. But again, in that case it seems like the limiting factor not enough samples / sequences to get a good representative dataset.
That being said, it seems like validation of generative models is not trivial, and so there isn't a recipe for doing it in general though there are some examples of how others have done it in a reasonable way. https://arxiv.org/pdf/1712.02311.pdf Does this apply to our situation?"
Sources: https://www.quora.com/Should-we-remove-duplicates-from-a-data-set-while-training-a-Machine-Learning-algorithm-shallow-and-or-deep-methods https://stats.stackexchange.com/questions/222297/do-examples-in-the-training-and-test-sets-have-to-be-independent https://arxiv.org/pdf/1712.02311.pdf https://arxiv.org/pdf/1707.02392.pdf
The last two (arxiv) are more about the question of how to validate generative models in general, if I remember right. The first two are more about the specific question you are discussing here. I guess I didn't encounter any official literature on that.
We're currently training on a set of sequences and then comparing the fit to the training data. It's good that we do in fact fit the training data, but perhaps that's not surprising given the flexibility of NNs.
@jjfeng has suggested that a better type of validation would be to use a second data set as a validation. It seems like we can do the same set of comparison using biologically-motivated summary statistics. The trick: how to do a train/test split? Let's stick to TCRs, so we don't have the clonal family problem.
Repertoires vary between individuals, so we'd want to train and test on samples from the same individual, and not separated by some big immune event like a vaccination. So we could split a single repertoire in half to do a comparison. Jean says:
Discuss.