psathyrella / partis

B- and T-cell receptor sequence annotation, simulation, clonal family and germline inference, and affinity prediction
GNU General Public License v3.0
55 stars 34 forks source link

Not reproducible results after VL update #208

Closed krdav closed 8 years ago

krdav commented 8 years ago

Related to my test of performance in issue: #207

This time I did the test on 5,000 sequences old vs. new: partis_old_fullpart10k.log.txt partis_new_fullpart10k.log.txt

Notice that the performance of the sw align step is still much slower in the new partis but because of faster bcrham and more calls to bcrham in the partitioning protocol it makes up for the loss.

But the performance aside. What I am more worried about is the difference in partitioning outcome: partis_old_partition.csv.txt partis_new_partition.csv.txt

I was in the in the belief that partitioning in partis was deterministic and the same goes for sw alignment and bcrham, but here the results of new vs. old are clearly different. Also it is interesting that the old partis achieves a better log likelihood than the new.

psathyrella commented 8 years ago

Short answer: no, partitioning is not deterministic. It can be made deterministic -- there are substantial shenanigans in the testing framework to ensure that if you pass the same --seed argument that all steps will give reproducible answers. But I'm not sure that this is a desirable thing to do, in general, because it could be interpreted as indicating a false sense of certainty as to how close we are to the correct partition (well, maybe a better way to say is it could be interpreted as under-estimating the uncertainty). Also, between versions of partis separated by lots of time, many little things change which make small differences -- adding and removing germline genes, changing interpolation schemes, etc.

The main source of non-determinism is that at each stage of parallellization we decide which sequences go to which subprocess. I tried a bunch of fancy stuff with sending more similar sequences to the same subprocess, but it ended up being faster to apportion them randomly (the idea is we want to equalize the number of calculations over all the subprocesses). So if you don't explicitly set the same seed on subsequent runs, the random apportionment will send different sequences to each subprocess. Then, for instance, say a1, a2, and a3 are clonal: if a1 and a2 are together in a subprocess at the first step, they'll get merged in the first step and later on merged with a3. But different apportionments will result in different orders.

This doesn't, generally, make a big difference -- if they're clonal, they'll have similar annotations. But in a large enough sample you'll always have weird stuff like a bunch of unlikely mutations stacked next to each other in, say, a3, so it makes some difference.

But the point is I don't think this is a problem -- it's telling us the same thing as the progression of log likelihood-partition pairs in the output csv: as we go along merging clusters there is at some point a peak in the likelihood which has some finite breadth. The most likely partition, for sure, is the one at the top. But how much time we need to spend looking at the nearby partitions (i.e. looking at the least certain merges) depends on a given application.

Incidentally, you can increase the number of partitions surrounding the mostly likely one that it writes to file with --n-partitions-to-write.

psathyrella commented 8 years ago

oh, also: I wouldn't ascribe any significance to the difference in likelihood between the two versions. Innumerable changes will have affected the absolute likelihood numbers.

also also: #91 discusses the question of clustering uncertainty in somewhat more detail, in the context of sequential monte carlo.

krdav commented 8 years ago

If I get it correctly what you are saying is that the uncertainty comes from the trade-off between computations spend on partitioning, and that you could do the partitioning deterministic by just running it on one core. Then the different log likelihoods that are displayed in the files I attached can actually be interpreted as comparable and clearly one is better than the other, but the argument is then that the small difference in LL is not worth any extra search.

psathyrella commented 8 years ago

I would say that the uncertainty comes from the fact that the clustering (and annotation) problem is not inherently trivial. Writ small: for instance, if the difference between clonal family A and family B is a single-base germline snp in the V gene, there isn't any way to say with much certainty to which family a sequence from one of these belongs. It could be either.

In this case, yes, you would probably get more similar results with one process than with many, but that wouldn't be telling us anything interesting -- we're just shuffling things within the uncertainties.

In general, yes, definitely, a file could have fifty sequential partitiions that didn't differ significantly in likelihood. The sharpness of the likelihood peak is something that varies from sample to sample. A good way to get some intuition for what this means is to print out the annotations for each cluster before and after merging.

There isn't any cosmic significance to the number of partitions in the output file -- we report the best one, and a number around the best one, but the default number is just a tradeoff with file size, memory usage, and computation time.

krdav commented 8 years ago

Thanks Duncan, I will close it issue since my questions were answered.