matsengrp / linearham

A Bayesian Phylo-HMM for B cell receptor sequence analysis
http://matsengrp.github.io/linearham
6 stars 4 forks source link

light chain crash #83

Open psathyrella opened 2 years ago

psathyrella commented 2 years ago

At least on my last pull, it was crashing when it was parsing the linearham annotation, but unfortunately I can't find the err logs atm. But if someone is working on this and can't replicate it just lmk. Here's a light chain file for testing: https://github.com/psathyrella/partis/blob/dev/test/paired/ref-results/partition-new-simu/partition-igk.yaml

psathyrella commented 2 years ago

ok! i think i have tracked this down. To replicate (rename yaml since github won't let me upload yaml):

./scripts/write_lh_annotations.py cluster.yaml linearham_run.log --output-base path/to/out

linearham_run.log cluster_yaml.log

issue seems to be that linearham doesn't write the actual VJ insertion for light chain insertions (like it does for heavy insertions), instead just writes a boolean if it's there: heavy light

which means that this line fails: https://github.com/matsengrp/linearham/blob/master/scripts/write_lh_annotations.py#L59 because it's resetting the deletion length without changing the insertion length:

          File "./scripts/write_lh_annotations.py", line 61, in update_partis_line_with_lh_annotation
            utils.add_implicit_info(glfo, partis_line, check_line_keys=debug)
          File "/home/dralph/work/linearham/lib/partis/python/utils.py", line 1711, in add_implicit_info
            raise Exception('naive and mature sequences different lengths %d %d for %s:\n    %s\n    %s' % (len(line['naive_seq']), len(line['seqs'][iseq]), ' '.join(line['unique_ids']), line['naive_seq'], line['seqs'][iseq]))
        Exception: naive and mature sequences different lengths 321 322 for 0901808538749971629-igk 5827556154176009883-igk 0316148055258110569-igk 0673794353012672613-igk 5234516726189794955-igk:
            GACATCCAGATGACCCAGTCTCCATCTTCTGTGTCTGCATCTGTAGGAGACAGAGTCACCATCACTTGTCGGGCGAGTCAGGGTATTAGCAGCTGGTTAGCCTGGTATCAGCAGAAACCAGGGAAAGCCCCTAAGCTCCTGATCTATGCTGCATCCAGTTTGCAAAGTGGGGTCCCATCAAGGTTCAGCGGCAGTGGATCTGGGACAGATTTCACTCTCACTATCAGCAGCCTGCAGCCTGAAGATTTTGCAACTTACTATTGTCAACAGGCTAACAGTTTCCCTGCACTTTTGGCCAGGGGACCAAGCTGGAGATCAAAC                                                                                                                                                                                   
            GACATCCAAATAACCCAGTCTCCATCTTCTGCGTCTCCATCTGGAGTCGACAGAGTCACCATCACTTCTCGGGCTAGTCAGGGCATTAGCAGCTGGTTAGCTTGGTATCAGCAGCAGCCAGGGCAAGCCCCTAAGCTCCTGTTCTATGCTGCATGCAGTTTGCAAAGTGGAGTCCCATCAAGGTTCAGCGGCCGTGGATCTGGGACAGATTTCACTCTCACTATCAGCAGCCTGCCGCCTGAAAATTTTGTACCTTACTATTGTCAACAGGCTAACAGGTTCCCTTGCCCTTTTGGCTAGGGGACAAATCTGGAGATCATTC                                                                                                                                                                                  
psathyrella commented 2 years ago

This seems to be written by this line: https://github.com/matsengrp/linearham/blob/master/src/PhyloHMM.cpp#L320

but I can't figure out how the value of vd_junction_insertion_samp_ gets set.

psathyrella commented 2 years ago

Well this (and this) get around the issue by looking for TRUE valued VJ insertion and figuring out what it's supposed to be, but it would be a lot better to figure out why linearham is writing the TRUE to begin with.