Closed jotwin closed 8 years ago
If I run on that sequence (with an unrelated parameter set, but the annotation looks the same) it comes out as productive (mutated_invariants, in_frames, and stops columns in the pasted output below). If it's unproductive that should get printed in the sw debug output, too. What's making you conclude that it's unproductive?
unique_ids,v_gene,d_gene,j_gene,cdr3_length,seqs,naive_seq,indelfos,aligned_v_seqs,aligned_d_seqs,aligned_j_seqs,v_per_gene_support,d_per_gene_support,j_per_gene_support,v_3p_del,d_5p_del,d_3p_del,j_5p_del,v_5p_del,j_3p_del,vd_insertion,dj_insertion,fv_insertion,jf_insertion,mutated_invariants,in_frames,stops
foop,IGHV1-2*04,IGHD2-8*01,IGHJ3*02,42,GCTTCTGGATACACCTTCACCGGCTACTATATGCACTGGGTGCGACAGGCTCCTGGAAAAGGGCTTGAGTGGATGGGAGGTTTTGATGCTGGTGATGGTGGAATAATCTACGCACAGAAGGTCCAGGGCAGGGTCACCATGACCAAAGACACATCTACAGACACAGCCTACATGGAGGTGAGCAGCCTGAGATCTGAGGACACGGCCGTGTATTACTGTGCAACAGATCATGGAGTAATTAATGCTTTTGATATCTGGGGCCAAGGGACAAT,GCTTCTGGATACACCTTCACCGGCTACTATATGCACTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGATGGATCAACCCTAACAGTGGTGGCACAAACTATGCACAGAAGTTTCAGGGCTGGGTCACCATGACCAGGGACACGTCCATCAGCACAGCCTACATGGAGCTGAGCAGGCTGAGATCTGACGACACGGCCGTGTATTACTGTGCGAGAGATAATGGTGTAATTGATGCTTTTGATATCTGGGGCCAAGGGACAAT,"[{'reversed_seq': 'GCTTCTGGATACACCTTCACCGGCTACTATATGCACTGGGTGCGACAGGCTCCTGGAAAAGGGCTTGAGTGGATGGGAGGTTTTGATGCTGGTGATGGTGGAATAATCTACGCACAGAAGGTCCAGGGCAGGGTCACCATGACCAAAGACACATCTACAGACACAGCCTACATGGAGGTGAGCAGCCTGAGATCTGAGGACACGGCCGTGTATTACTGTGCAACAGATCATGGAGTAATTAATGCTTTTGATATCTGGGGCCAAGGGACAAT', 'indels': [{'seqstr': 'GGCAGGGTCACCATGACCA', 'type': 'deletion', 'pos': 126, 'len': 19}]}]",NNNNNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGCTTCTGGATACACCTTC............ACCGGCTACTATATGCACTGGGTGCGACAGGCTCCTGGAAAAGGGCTTGAGTGGATGGGAGGTTTTGATGCTGGT......GATGGTGGAATAATCTACGCACAGAAGGTCCAG...GGCAGGGTCACCATGACCAAAGACACATCTACAGACACAGCCTACATGGAGGTGAGCAGCCTGAGATCTGAGGACACGGCCGTGTATTACTGTGCAACAGA,NNNNNNNNNNN...N.......TCATGGAGTANN......N...NNNNNN...,...............TAATGCT...T...............TTGATATCTGGGGCCAAGGGACAATNNNNNNNNNNNNNNNNN,IGHV1-2*04:0.963037400215;IGHV1-2*02:0.036962476457;IGHV1/OR15-1*04:1.23328287775e-07,IGHD2-8*01:0.944397466123;IGHD3-10*01:0.0364404755364;IGHD3-16*02:0.0104692786133;IGHD2-2*03:0.00850340032854;IGHD2-15*01:0.000189379398261,IGHJ3*02:1.0,0,12,9,0,69,17,,AT,,,False,True,False
The annotation output labeled it as out of frame and with a stop codon
unique_ids,v_gene,d_gene,j_gene,cdr3_length,seqs,naive_seq,indelfos,cyst_position,tryp_position,aligned_v_seqs,aligned_d_seqs,aligned_j_seqs,v_per_gene_support,d_per_gene_support,j_per_gene_support,v_3p_del,d_5p_del,d_3p_del,j_5p_del,v_5p_del,j_3p_del,vd_insertion,dj_insertion,fv_insertion,jf_insertion,mutated_invariants,in_frames,stops
ERR769569.18434-2,IGHV1-69-2*01,IGHD3-10*02,IGHJ3*02,178,GCTTCTGGATACACCTTCACCGGCTACTATATGCACTGGGTGCGACAGGCTCCTGGAAAAGGGCTTGAGTGGATGGGAGGTTTTGATGCTGGTGATGGTGGAATAATCTACGCACAGAAGGTCCAGGGCAGAGTCACCATGACCGAAGACACATCTACAGACACAGCCTACATGGAGGTGAGCAGCCTGAGATCTGAGGACACGGCCGTGTATTACTGTGCAACAGATCATGGAGTAATTAATGCTTTTGATATCTGGGGCCAAGGGACAAT,GTTTCTGGATACACCTTCACCGACTACTACATGCACTGGGTGCAACAGGCCCCTGGAAAAGGGCTTGAGTGGATGGGACTTGTTGATCCTGAAGATGGTGAAACAATATACGCAGAGAAGTTCCAGGGCAGAGTCACCATAACCGCGGACACGTCTACAGACACAGCCTACATGGAGCTGAGCAGCCTGAGATCTGAGGACACGGCCGTGTATTACTGTGCAACAGATCGGGGAGTTATTAATGCTTTTGATATCTGGGGCCAAGGGACAAT,"[{'reversed_seq': 'GCTTCTGGATACACCTTCACCGGCTACTATATGCACTGGGTGCGACAGGCTCCTGGAAAAGGGCTTGAGTGGATGGGAGGTTTTGATGCTGGTGATGGTGGAATAATCTACGCACAGAAGGTCCAGGGCAGAGTCACCATGACCGAAGACACATCTACAGACACAGCCTACATGGAGGTGAGCAGCCTGAGATCTGAGGACACGGCCGTGTATTACTGTGCAACAGATCATGGAGTAATTAATGCTTTTGATATCTGGGGCCAAGGGACAAT', 'indels': [{'seqstr': 'GGGCAGAGTCACCATGACC', 'type': 'deletion', 'pos': 125, 'len': 19}]}]",80,255,NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGCTTCTGG......ATACACCTTCACCGG............CTACTATATGCACTGGGTGCGACAGGCTCCTGGAAAAGGGCTTGAGTGGATGGGAGGTTT.................................TGATGC......TGGTGA......TGGTGGAATAATCTACGCACA........................GAAGGTCCAGGGCAGAGTCACCATGA.........................CCGAAGACACATCTACAGACACAGCCTACATGGAGGTGAGCAGCCTGAGATCTGAGGACACGGCCGTGTATTACTGTGCAACAGA......,NNNNNNNNNNN...........NTCATGGAGTAA......T...TANNNN...,...............NNATGCT...T...............TTGATATCTGGGGCCAAGGGACAATNNNNNNNNNNNNNNNNN,IGHV1-69-2*01:0.999694680371;IGHV1-24*01:0.000305319628971,IGHD3-10*02:0.991278074782;IGHD3-3*02:0.00709343774334;IGHD2-8*01:0.0015239019153;IGHD3-10*01:7.31114781405e-05;IGHD3-3*01:3.1474080761e-05,IGHJ3*02:1.0,0,12,4,2,69,17,,,,,True,False,True
Could you send a tgz of the parameter directory so I can try to replicate this?
Ok I sent it to you. I was looking at my parameters and saw there were a lot of outliers, like very long indels and cdr3_lengths. I also noticed that for some annotated sequences the conserved cystein position was wrong, leading to bad cdr3_lengths and wrong in-frame.
[1] "GCTTCTGGTTACACCTTTACCAGCTATGGTATCAGCTGGGTGCGACAGGCCCCTGGAAAAGGGCTTGAGTGGATGGGAGGATTTGATGCTGAAGATGGTGAAACAATCTACGCACAGAAATTCCAGGGCAGAGTCACCATGACCGAGGACACATCTACAGACACAGTCTACATGGAGCTAAGGAGCCTGAGATCTGAGGACACGGCCGTCTATTACTGTGCAACAGGTCACAGAGTAATTAATGCTTTTGATATCTGGGGCCAAGGGACAGT"
[2] "GCATCTGGATACAGCTTCACCAGCTATTATATGCACTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGAGGCTTTGATGCTGGAGATGGTGGCATAATCTACGCACAGAAGTTCCAGGGCAGAGTCACCATGACCGAGGACACATCTACAGACACAGCCTACATGGAGCTGAGCAGCCTGAGATCTGTGGACACGGCCGTGTATTACTGTGTAACAGATCGAATGGTTATCAACGCTTTTGATATTTGGGGCCAAGGGACAAT"
[3] "GCTTCTGGATACACCTTCACCGGCTACTATATGCACTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGAGGCTTTGATGCTGGTGATGGTGGAATAATCTACGCACAGAAGGTCCAGGGCAGAGTCACCATGACCGAAGACATATCTATAGACACAGCCTACATGGAGGTGAGCAGCCTGCGATCTGAGGACACGGCCGTGTATTACTGTGCAACAGATCATGGAGTCATTAAGGCTTTTAATATCTGGGGCCAAGGGACAAG"
[4] "GCTTCTGGATACACCTTCACTAGCTATGCTATGCACTGGGTGCGACAGGCTCCTGGAAAAGGACTTGAGTGGATGGGAGGTCTTGATCCTGAAGACGATAAGATAATCTACGCACAGAATTTCCAGGGCAGAGTCACCATGACCGAGGACACATCTACAGACACAGCCTACATGGAGTTGAGCAGCCTGAGATCTGAGGACACGGCCGTATATTACTGTGTAACAGATCGAGTGGTTATAAATGCTTTTAATATTTGGGGCCAAGGGACAAT"
[5] "GCTTCTGGTTATAGGTTTACGAGTTATGGCATCAGTTGGGTGCGACAGGCCCCAGGACAAGGGCTTGAGTCCATGGGAGGTTTTAATCCTGAAGATGGTGACACAATCTACGCACAGAAATTCCAGGGCAGAGTCACCATGACCGAGGACACATCTACAGACACAGTCTACATGGAGCTAAGGAGCCTGAGATCTGAGGACACGGCCGTCTATTACTGTGCAACAGGTCACAGAGTAATTAATGCTTTTGATATCTGGGGCCAAGGGACAGT"
[6] "GCTTCTGGAGGCACCTTCAGCAGCTATGCTATCAGCTGGGTGCGACAGGCTCCTGGAAAAGGGCTTGAGTCCATGGGAGGTTTTAATCCTGAAGATGGTGACACAATCTACGCACAGAAATTCCAGGGCAGAGTCACCATGACCGAGGACACATCTACAGACACAGTCTACATGGAGCTAAGGAGCCTGAGATCTGAGGACACGGCCGTCTATTACTGTGCAACAGGTCACAGAGTAATTAATGCTTTTGATATCTGGGGCCAAGGGACAGT"
[7] "GCTTCTGGATACACCTTCACCAGTTATGATATCAACTGGGTGCGACAGGCCACTGGAAAAGGGCTTGAGTCCATGGGAGGTTTTAATCCTGAAGATGGTGACACAATCTACGCACAGAAATTCCAGGGCAGAGTCACCATGACCGAGGACACATCTACAGACACAGTCTACATGGAGCTAAGGAGCCTGAGATCTGAGGACACGGCCGTCTATTACTGTGCAACAGGTCACAGAGTAATTAATGCTTTTGATATCTGGGGCCAAGGGACAGT"
[8] "GCATCTGGATACACCTTCACCAGCTACTATATACACTGGGTGCGACAGGCTCCTGGAAAAGGGCTTGAGTCCATGGGAGGTTTTAATCCTGAAGATGGTGACACAATCTACGCACAGAAATTCCAGGGCAGAGTCACCATGACCGAGGACACATCTACAGACACAGTCTACATGGAGCTAAGGAGCCTGAGATCTGAGGACACGGCCGTCTATTACTGTGCAACAGGTCACAGAGTAATTAATGCTTTTGATATCTGGGGCCAAGGGACAGT"
[9] "ACTTCTGGTTACACGTTTACCAATTATGGTATCAACTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTCCATGGGAGGTTTTAATCCTGAAGATGGTGACACAATCTACGCACAGAAATTCCAGGGCAGAGTCACCATGACCGAGGACACATCTACAGACACAGTCTACATGGAGCTAAGGAGCCTGAGATCTGAGGACACGGCCGTCTATTACTGTGCAACAGGTCACAGAGTAATTAATGCTTTTGATATCTGGGGCCAAGGGACAGT"
[10] "GCTTCTGGTTACACCTTTACCAGCTATGGTATCAGCTGGGTGCGACAGGCCCCTGGAAAAGGGCTTGAGTCCATGGGAGGTTTTAATCCTGAAGATGGTGACACAATCTACGCACAGAAATTCCAGGGCAGAGTCACCATGACCGAGGACACATCTACAGACACAGTCTACATGGAGCTAAGGAGCCTGAGATCTGAGGACACGGCCGTCTATTACTGTGCAACAGGTCACAGAGTAATTAATGCTTTTGATATCTGGGGCCAAGGGACAGT"
very long indels and cdr3_lengths.
hm, ok, that doesn't surprise me very much -- it's trying to make these sequences look like legitimate BCR sequences (it doesn't know that it's supposed to find a big deletion in every sequence where the reads don't quite meet each other), and I would expect there to be cases where this conflict is apparent.
I'll look at it carefully tomorrow, I want to make sure all the indel reversion stuff is legit since I haven't looked at it in a while.
the conserved cystein position was wrong
in which sense of wrong? You know from outside information that it's somewhere other than we report it?
Got it running... and ah, I see what you mean, the cyst position is blatantly ridiculous. Looks like a bug, *#*$!
yeah! Thanks for noticing! Maybe I don't adjust the cyst position in the annotation when I un-reverse the indels? Should be easy to fix.
ha, it's not anything so complicated -- it's just that the cysteine position I have for that gene is at a TGT that's nowhere near the CDR3. I think that gene's been in the file since I first got it, but last week I finished writing stuff to infer missing cysteine positions, so I'll add in a line to make sure the cysteine positions in the csv file are consistent with the imgt alignments.
It'll take me a bit (~few days) to merge into master, but if you want to fix it quickly you can just replace that 149 with 285.
almost merged in... but thought I'd mention here that you actually can't validate the cysteine positions in the genes from the germline set, because a large fraction (10-20%) of imgt F + ORF are truncated before the conserved cysteine.
Merged it.
I noticed that many sequences that were annotated as out of frame and with stop codons, that is unproductive, appear to be in frame and without stops when I translate to AA. Igblast also labeled these sequences as productive. I have some output of --debug 1 for one of these queries, which I don't know how to interpret. My sequences are joined from two non-overlapping reads, which is why partis finds a big deletion. Is that messing up the annotation of productivity?