psathyrella / partis

B- and T-cell receptor sequence annotation, simulation, clonal family and germline inference, and affinity prediction
GNU General Public License v3.0
55 stars 34 forks source link

Light chain partitioning inconsistencies #217

Closed krdav closed 7 years ago

krdav commented 7 years ago

I am running light chain clustering with --print-cluster-annotations and I am getting some inconsistent results comparing to the heavy chain output. So first inconsistency: Normally the last number in the clustering output is also equal to the final number of clusters. E.g. from the output hereunder, the total should be 42:

--> 51 clusters with 2 procs
hmm
    writing input
  ESC[91mwarningESC[0m no hmm files for glfo genes IGKV3-20*02 IGKV3-20*01 IGKV1-5*02 IGKV6D-21*02 IGKV3-11*02 IGKV2-30*02 IGKV1D-17*01 IGKV6D-41*01 IGKV2-28*01 IGKV2D-24*01 IGKV1-13*02 IGKV1D-42*02 IGK
       naive hfrac bounds: 0.015 0.072   (0.032 mutation in _output/305-306_mergedAAidx_98centroids/hmm)
    running 2 procs
      --> proc 1
        read 11335 cached logprobs and 6749 naive seqs
        stop with:  big hfrac 28   small lratio 0   total 171
        calcd:   vtb 5     fwd 7     hfrac 86          merged:  hfrac 3    lratio 3   
        time: bcrham 3.5

      --> proc 0
        read 11335 cached logprobs and 6749 naive seqs
        stop with:  big hfrac 44   small lratio 0   total 253
        calcd:   vtb 3     fwd 9     hfrac 71          merged:  hfrac 0    lratio 3   
        time: bcrham 4.6

      time waiting for bcrham: 4.8
      hmm step time: 5.5
          n calcd: 24 (12.0 per proc)
--> 42 clusters with 1 procs
hmm
    writing input
  ESC[91mwarningESC[0m no hmm files for glfo genes IGKV3-20*02 IGKV3-20*01 IGKV1-5*02 IGKV6D-21*02 IGKV3-11*02 IGKV2-30*02 IGKV1D-17*01 IGKV6D-41*01 IGKV2-28*01 IGKV2D-24*01 IGKV1-13*02 IGKV1D-42*02 IGK
       naive hfrac bounds: 0.015 0.072   (0.032 mutation in _output/305-306_mergedAAidx_98centroids/hmm)
    running 1 procs
        read 11351 cached logprobs and 6762 naive seqs
        stop with:  big hfrac 78   small lratio 0   total 465
      calculating and writing annotations
        calcd:   vtb 38    fwd 46    hfrac 229         merged:  hfrac 4    lratio 7   
        time: bcrham 169.4

      time waiting for bcrham: 169.5
      hmm step time: 170.2
      loop time: 322.6
    writing cluster annotations to 305-306_merged_centroid_partitions-cluster-annotations.csv
  annotations for final partition:
    read output
      100001A41465L2P0H0:158989A41659L2P0H0:21935A150576L2P0H0:255017A50093L2P0H0:261994A40295L2P0H0:63652A112358L2P0H0:158767A149307L2P0H0:141576A123735L2P0H0:159883A111273L2P0H0:44976A52484L2P0H0:2259
    inferred:

Second: Counting the number of lines in foo-cluster-annotations.csv where the annotations are dumped one line per cluster, I get 32 lines.

Third: To further confuse the matter, when I look at the partition output file foo.csv I get the following:

logprob,n_clusters,n_procs,partition
-1119584.1008244697,35,1,xxxxxxxxxxxxxxxx
-1119569.3345166682,34,1,xxxxxxxxxxxxxxxx
-1127494.147480261,33,1,xxxxxxxxxxxxxxxx
-1127403.1097603654,32,1,xxxxxxxxxxxxxxxx
-1127611.804724871,31,1,xxxxxxxxxxxxxxxx

Which suggests that the best number of clusters is actually not 32 but 35. What is going on here??? :)

psathyrella commented 7 years ago

So I'm definitely guessing, without the sequences to run on, and full output (and I'm not sure what's supposed to be 42 or not above?). But I think I can explain what's going on.

So, within any given clustering step (i.e. within each c++ process, so 15 of these are going on simultaneously if it says "N clusters with 15 procs"), it first checks all the clusters' naive hamming distances, and goes through cycles of merging until it runs out of sequences that we want to merge just based on naive hamming distance (i.e. that are really close). Then, it goes through calculating the likelihoods for all the cluster pairs that are kinda-sorta close in naive hamming, and merging all of these that have big likelihood ratios.

Now, getting to the point, each time we decide to merge two clusters, we immediately calculate the naive sequence for the joint cluster (i.e. we annotate the new merged cluster), so we can do the naive hamming preclustering on it. But we don't calculate the log probability by default, since that would involve calculating lots of likelihoods that we don't end up needing. This means that we get to the end of clustering (no more likelihood ratios bigger than 1) in that particular subprocess, and we don't know the full likelihood of the entire partition. Which is the way we want it -- it's an optimization. But, the last clustering step, with only one process, we make sure to calculate all the likelihoods before spitting out the final list of partitions, so in the output file you see the list of partitions and their likelihoods. The upshot being, in some cases the c++ process will make a naive hamming merge which the python process, upon reading the likelihoods that were calculated after making the merge, decides "should" not have happened. The c++ decides what clusters to annotate based on what it knows about the likelihood, so the cluster-annotations.csv file could be different to the most likely line in the csv partition file. I could be missing something in your output that's inconsistent -- I don't really understand what's going on without the full stdout and files -- but I don't view this as a problem: in most realistic data sets the most likely partition is not significantly more likely than the surrounding partitions, and I find that this likelihood-based statement is typically intuitively easy to verify by annotating the few most likely partitions and seeing that the look roughly equally plausible.

krdav commented 7 years ago

The 42 I referred to was from --> 42 clusters with 1 procs but I see that this is actually not necessarily the last partition which I had gotten into my mind because I had seen this number being similar with the first line in partition file. So this is definitely not a problem per see.

For the other two, well I think I understand the reason, I still think the output is slightly inconsistent, because the expected number of clusters in the annotation file (at least in my mind) is the highest likelihood estimate of the partitions. That said I totally understand that all those partitions listed with a very high likelihood are not significantly more or less likely and hence it doesn't matter if it is the annotations for number 4 on the list that are printed and not the ones from the first partition on the list.

psathyrella commented 7 years ago

ok I think I understand everything above if the 42 isn't involved -- and yes, the 32 lines in the cluster annotation file correspond to the likelihood of -1127403, whereas the 34-cluster line with likelihood of -1119569 is more likely, and running with --debug 1 would no doubt tell us that two of the four hfrac merges it says it made in the last step turned out to be merges that likelihood disagreed with.

And I should have said that I don't view it as a big problem -- I agree the inconsistency is confusing and it's on the todo list to clarify, it will just involve some shenanigans to circumvent how things are set up, and it's easy to get the annotations for any clusters in any of the partitions anyway, with --queries, so...