psathyrella / partis

B- and T-cell receptor sequence annotation, simulation, clonal family and germline inference, and affinity prediction
GNU General Public License v3.0
55 stars 34 forks source link

Output more S-W details for linearham #218

Closed matsen closed 6 years ago

matsen commented 7 years ago

We need max and min of the qrbounds for each matched gene, as well as the relative alignment of the read to each matched gene. According to @psathyrella , the best strategy is to put it into the infoline data structure.

For a qinfo that looks like this:

{'first_match_qrbounds': (0, 296),
 'glbounds': {'IGHD2-15*01': (3, 31),
              'IGHD2-2*03': (0, 31),
              'IGHJ6*02': (3, 62),
              'IGHJ6*03': (3, 62),
              'IGHV1-2*02': (0, 296)},
 'matches': {'d': [(140, 'IGHD2-15*01'), (99, 'IGHD2-2*03')],
             'j': [(288, 'IGHJ6*02'), (260, 'IGHJ6*03')],
             'v': [(1473, 'IGHV1-2*02')]},
'name': 'V1-2_04-D2-15_01-J6_01_del3d_Tins296_331_vins28',
 'new_indel': False,
 'qrbounds': {'IGHD2-15*01': (302, 330),
              'IGHD2-2*03': (299, 330),
              'IGHJ6*02': (336, 395),
              'IGHJ6*03': (336, 395),
              'IGHV1-2*02': (0, 296)},
 'seq': 'CAGGTGCAGCTGGTGCAGTCTGGGGCTGAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTCTCCTGCAAGGCTTCTGGATACACCTTCACCGGCTACTATATGCACTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGATGGATCAACCCTAACAGTGGTGGCACAAACTATGCACAGAAGTTTCAGGGCTGGGTCACCATGACCAGGGACACGTCCATCAGCACAGCCTACATGGAGCTGAGCAGGCTGAGATCTGACGACACGGCCGTGT
ATTACTGTGCGAGAGATTTTTTATATTGTAGTGGTGGTAGCTGCTACTCCGGGGGGACTACTACTACTACGGTATGGACGTCTGGGGGCAAGGGACCACGGTCACCGTCTCCTCAG'}

and an indelinfo like this:

{'V1-2_04-D2-15_01-J6_01_del3d_Tins296_331_vins28': {
   'indels': [{'len': 6,
   'pos': 27,
   'seqstr': 'CCCCCC',
   'type': 'insertion'}],
   'reversed_seq': 'CAGGTGCAGCTGGTGCAGTCTGGGGCTGAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTCTCCTGCAAGGCTTCTGGATACACCTTCACCGGCTACTATATGCACTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGATGGATCAACCCTAACAGTGGTGGCACAAACTATGCACAGAAGTTTCAGGGCTGGGTCACCATGACCAGGGAC
ACGTCCATCAGCACAGCCTACATGGAGCTGAGCAGGCTGAGATCTGACGACACGGCCGTGTATTACTGTGCGAGAGATTTTTTATATTGTAGTGGTGGTAGCTGCTACTCCGGGGGGACTACTACTACTACGGTATGGACGTCTGGGGGCAAGGGACCACGGTCACCGTCTCCTCAG'}}

I propose augmenting the infoline with two new items,

 'qr_relpos': {'IGHD2-15*01': 299,
              'IGHD2-2*03': 299,
              'IGHJ6*02': 333,
              'IGHJ6*03': 333,
              'IGHV1-2*02': 0}

and

 'qr_boundsbounds': {'d': (299, 330),
              'j': (336, 395),
              'v': (0, 296)}

The qr_relpos is simply the left hand qrbound minus the left hand glbound, and the boundsbounds is just the max and the min of each of the qrbounds across the genes.

There is already this:

                swfo['regional_bounds'][region] = tuple([rb + padleft for rb in swfo['regional_bounds'][region]])  # I kind of want to just use a list now, but a.t.m. don't much feel like changing it everywhere else

which I don't totally get, but which evaluated on the above gives

 'regional_bounds': {'d': (302, 330), 'j': (336, 395), 'v': (0, 296)},

so not exactly what I seem to want.

psathyrella commented 7 years ago

why don't you use 'all_matches'? then you won't have to modify the header consistency enforcement stuff. Can turn to ordered dict if you want to preserve order

On Mon, Oct 17, 2016 at 6:18 AM, Erick Matsen notifications@github.com wrote:

We need max and min of the qrbounds for each matched gene, as well as the relative alignment of the read to each matched gene. According to @psathyrella https://github.com/psathyrella , the best strategy is to put it into the infoline data structure.

For a qinfo that looks like this:

{'first_match_qrbounds': (0, 296), 'glbounds': {'IGHD2-15_01': (3, 31), 'IGHD2-2_03': (0, 31), 'IGHJ6_02': (3, 62), 'IGHJ6_03': (3, 62), 'IGHV1-2_02': (0, 296)}, 'matches': {'d': [(140, 'IGHD2-15_01'), (99, 'IGHD2-2_03')], 'j': [(288, 'IGHJ6_02'), (260, 'IGHJ6_03')], 'v': [(1473, 'IGHV1-2_02')]},'name': 'V1-2_04-D2-15_01-J6_01_del3d_Tins296_331_vins28', 'new_indel': False, 'qrbounds': {'IGHD2-15_01': (302, 330), 'IGHD2-2_03': (299, 330), 'IGHJ6_02': (336, 395), 'IGHJ6_03': (336, 395), 'IGHV1-2*02': (0, 296)}, 'seq': 'CAGGTGCAGCTGGTGCAGTCTGGGGCTGAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTCTCCTGCAAGGCTTCTGGATACACCTTCACCGGCTACTATATGCACTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGATGGATCAACCCTAACAGTGGTGGCACAAACTATGCACAGAAGTTTCAGGGCTGGGTCACCATGACCAGGGACACGTCCATCAGCACAGCCTACATGGAGCTGAGCAGGCTGAGATCTGACGACACGGCCGTGTATTACTGTGCGAGAGATTTTTTATATTGTAGTGGTGGTAGCTGCTACTCCGGGGGGACTACTACTACTACGGTATGGACGTCTGGGGGCAAGGGACCACGGTCACCGTCTCCTCAG'}

and an indelinfo like this:

{'V1-2_04-D2-15_01-J6_01_del3d_Tins296_331_vins28': { 'indels': [{'len': 6, 'pos': 27, 'seqstr': 'CCCCCC', 'type': 'insertion'}], 'reversed_seq': 'CAGGTGCAGCTGGTGCAGTCTGGGGCTGAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTCTCCTGCAAGGCTTCTGGATACACCTTCACCGGCTACTATATGCACTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGATGGATCAACCCTAACAGTGGTGGCACAAACTATGCACAGAAGTTTCAGGGCTGGGTCACCATGACCAGGGACACGTCCATCAGCACAGCCTACATGGAGCTGAGCAGGCTGAGATCTGACGACACGGCCGTGTATTACTGTGCGAGAGATTTTTTATATTGTAGTGGTGGTAGCTGCTACTCCGGGGGGACTACTACTACTACGGTATGGACGTCTGGGGGCAAGGGACCACGGTCACCGTCTCCTCAG'}}

I propose augmenting the infoline with two new items,

'qr_relpos': {'IGHD2-15_01': 299, 'IGHD2-2_03': 299, 'IGHJ6_02': 333, 'IGHJ6_03': 333, 'IGHV1-2*02': 0}

and

'qr_boundsbounds': {'d': (299, 330), 'j': (336, 395), 'v': (0, 296)}

The qr_relpos is simply the left hand qrbound minus the left hand glbound, and the boundsbounds is just the max and the min of each of the qrbounds across the genes.

There is already this:

            swfo['regional_bounds'][region] = tuple([rb + padleft for rb in swfo['regional_bounds'][region]])  # I kind of want to just use a list now, but a.t.m. don't much feel like changing it everywhere else

which I don't totally get, but which evaluated on the above gives

'regional_bounds': {'d': (302, 330), 'j': (336, 395), 'v': (0, 296)},

so not exactly what I seem to want.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/psathyrella/partis/issues/218, or mute the thread https://github.com/notifications/unsubscribe-auth/AC0rc9teFvIaO6oNpmwD5s8UKsHRxMpBks5q03WYgaJpZM4KYnJi .

matsen commented 7 years ago

Are you proposing implementing my proposed qr_relpos as a replacement for all_matches? If so that's great.

What about qr_boundsbounds?

psathyrella commented 7 years ago

Add whatever you need as values, as long as you keep the keys and order the same it won't affect anything else

On Oct 17, 2016 8:40 AM, "Erick Matsen" notifications@github.com wrote:

Are you proposing implementing my proposed qr_relpos as a replacement for all_matches? If so that's great.

What about qr_boundsbounds?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/psathyrella/partis/issues/218#issuecomment-254245238, or mute the thread https://github.com/notifications/unsubscribe-auth/AC0rc67YUQFnP8cXgGg8mTRhjpkulr-Aks5q05cFgaJpZM4KYnJi .

matsen commented 7 years ago

OK, so all_matches looks like {'j': ['IGHJ6*02', 'IGHJ6*03'], 'd': ['IGHD2-15*01', 'IGHD2-2*03'], 'v': ['IGHV1-2*02']} so this could be wedged in as

{
'j': [(299, 330), [{'IGHJ6*02': 333}, {'IGHJ6*03': 333}]], 
...
}

Is that better?

psathyrella commented 7 years ago

Almost, but you should keep the iteration properties the same

On Oct 17, 2016 9:15 AM, "Erick Matsen" notifications@github.com wrote:

OK, so all_matches looks like {'j': ['IGHJ6_02', 'IGHJ6_03'], 'd': ['IGHD2-15_01', 'IGHD2-2_03'], 'v': ['IGHV1-2*02']} so this could be wedged in as

{'j': [(299, 330), [{'IGHJ6_02': 333}, {'IGHJ6_03': 333}]], ... }

Is that better?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/psathyrella/partis/issues/218#issuecomment-254255221, or mute the thread https://github.com/notifications/unsubscribe-auth/AC0rcyRaxx6gX9kSbScVvU1QHtHcS6Wpks5q058qgaJpZM4KYnJi .

psathyrella commented 7 years ago

look, just grep for all_matches in the code -- it gets used in exactly one place outside waterer in partitiondriver. Your goal is to make it so that loop doesn't have to change.

matsen commented 7 years ago

After discussion, we agreed that this information doesn't fit neatly into all_matches, so it's going to go into new columns in the HMM input CSV (actually a space-delimited TSV).

The steps for this are

matsen commented 7 years ago

This is all done in draft form, as can be seen in https://github.com/psathyrella/partis/compare/27b978ff7130100b4f98a42bef0b0f70ae7ca6fe...218-linearham, using the old "fuzz" strategy. To make these definitions more clear, here's a picture:

The "flexbounds" are ranges of 0-indexed query-level coordinates of where the various gene matches could start and end. The "relpos" is the query-level coordinate of the untrimmed start of the gene. If the query read doesn't have the full V gene this will be negative.

And here is some example output (see the far right):

names k_v_min k_v_max k_d_min k_d_max mut_freq cdr3_length only_genes seqs flexbounds relpos
Dins_TTT 291 295 12 23 0.0 52 IGHV3-66*03:IGHV3-66*01:IGHJ5*02:IGHD2-15*01:IGHD6-6*01:IGHJ4*02:IGHD6-19*01 GAGGTGCAGCTGGTGGAGTCTGGGGGAGGCTTGGTCCAGCCTGGGGGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCGTCAGTAGCAACTACATGAGCTGGGTCCGCCAGGCTCCAGGGAAGGGGCTGGAGTGGGTCTCAGTTATTTATAGCGGTGGTAGCACATACTACGCAGACTCCGTGAAGGGCAGATTCACCATCTCCAGAGACAATTCCAAGAACACGCTGTATCTTCAAATGAACAGCCTGAGAGCCGAGGACACGGCTGTGTATTACTGTGCGAGAGATTTGAGTATAGCAGCTCGTCCACAACTGGTTCGACCCCTGGGGCCAGGGAACCCTGGTCACCGTCTCCTCAG "{""v_r"":[291,295],""d_r"":[305,316],""j_l"":[312,319],""j_r"":[363,367],""d_l"":[293,298],""v_l"":[0,2]}" "{""IGHV3-66*03"":0,""IGHV3-66*01"":0,""IGHD2-15*01"":283,""IGHD6-6*01"":296,""IGHJ4*02"":317,""IGHD6-19*01"":296,""IGHJ5*02"":314}"
J5pdel_2 291 295 9 20 0.0 47 IGHV3-66*03:IGHV3-66*01:IGHJ5*02:IGHD2-15*01:IGHD6-6*01:IGHJ4*02:IGHD6-19*01 GAGGTGCAGCTGGTGGAGTCTGGGGGAGGCTTGGTCCAGCCTGGGGGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCGTCAGTAGCAACTACATGAGCTGGGTCCGCCAGGCTCCAGGGAAGGGGCTGGAGTGGGTCTCAGTTATTTATAGCGGTGGTAGCACATACTACGCAGACTCCGTGAAGGGCAGATTCACCATCTCCAGAGACAATTCCAAGAACACGCTGTATCTTCAAATGAACAGCCTGAGAGCCGAGGACACGGCTGTGTATTACTGTGCGAGAGAGAGTATAGCAGCTCGTCCAACTGGTTCGACCCCTGGGGCCAGGGAACCCTGGTCACCGTCTCCTCAGNNNNN "{""v_r"":[291,295],""d_r"":[302,313],""j_l"":[308,314],""j_r"":[358,362],""d_l"":[291,300],""v_l"":[0,2]}" "{""IGHV3-66*03"":0,""IGHV3-66*01"":0,""IGHD2-15*01"":280,""IGHD6-6*01"":293,""IGHJ4*02"":312,""IGHD6-19*01"":293,""IGHJ5*02"":309}"
catted 291 295 9 20 0.0 49 IGHV3-66*03:IGHV3-66*01:IGHJ5*02:IGHD2-15*01:IGHD6-6*01:IGHJ4*02:IGHD6-19*01 GAGGTGCAGCTGGTGGAGTCTGGGGGAGGCTTGGTCCAGCCTGGGGGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCGTCAGTAGCAACTACATGAGCTGGGTCCGCCAGGCTCCAGGGAAGGGGCTGGAGTGGGTCTCAGTTATTTATAGCGGTGGTAGCACATACTACGCAGACTCCGTGAAGGGCAGATTCACCATCTCCAGAGACAATTCCAAGAACACGCTGTATCTTCAAATGAACAGCCTGAGAGCCGAGGACACGGCTGTGTATTACTGTGCGAGAGAGAGTATAGCAGCTCGTCCACAACTGGTTCGACCCCTGGGGCCAGGGAACCCTGGTCACCGTCTCCTCAGNNN "{""v_r"":[291,295],""d_r"":[302,313],""j_l"":[309,316],""j_r"":[360,364],""d_l"":[291,300],""v_l"":[0,2]}" "{""IGHV3-66*03"":0,""IGHV3-66*01"":0,""IGHD2-15*01"":280,""IGHD6-6*01"":293,""IGHJ4*02"":314,""IGHD6-19*01"":293,""IGHJ5*02"":311}"
dunleavy005 commented 7 years ago

Note: the V gene read position uncertainty in partis is determined by [k_v_min, k_v_max) (same for D genes).

matsen commented 7 years ago

We'd like to have the amounts by which reads got trimmed at the beginning and end; see https://matsengrp.slack.com/archives/bcell/p1481830831000123 and a few posts after that. Thanks!

psathyrella commented 7 years ago

ok, I think that's right. Let me know if there's a problem.

As far as the trimming, I think it makes more sense to just use --dont-remove-framework-insertions, then you don't have to worry about it.

dunleavy005 commented 7 years ago

^It's been while since I used partis for bcrham testing, but I believe this is exactly what I did.

matsen commented 7 years ago

Happy dance! Thanks, Duncan!

dunleavy005 commented 7 years ago

Just started playing with the --linearham option and things seem to work ok, but found some minor things that may need adjusting:

1) I was able to get a hmm_input.csv output file, BUT the err file said:

terminate called after throwing an instance of 'std::runtime_error'
  what():  args.cc: found unexpected header flexbounds' in input file /home/dunleavy005/partis/testing_amrit_results/hmm_input.csv

bcrham never crashed over not knowing what flexbounds/relpos headers were before, so I'm wondering why its yelling now.

2) For some context, my input FASTA file had 4 dummy sequences: --> a catted VDJ germline sequence --> catted seq. with nucleotide padding to the left of the V and the right of the J --> catted seq. with deletions in the middle --> catted seq. with deletions to the left of V and right of J (what you term "convenience" deletions)

Relative to the hmm_input.csv I obtained on old partis, the new partis does not append any framework N insertions to the left/right of the read for the DELETION sequences only. The flexbounds/relpos are still fine, but I wonder if there was anything substantial you changed since then that might cause this.

psathyrella commented 7 years ago

oh, I didn't know you were using bcrham to read it -- bcrham has a list of expected headers here that you can add it to, or if that's not clear I can do it.

I'm not sure that I understand the second question, but... without knowing what version of partis made the previous file, and even if I did there's little chance I'd remember all the changes. In any case maybe stop by irl next time you're here?

dunleavy005 commented 7 years ago

Yeah, maybe sometime tomorrow after group meeting?

psathyrella commented 7 years ago

sure

matsen commented 6 years ago

FWIW, I know it seems like this is completely dead, but in fact this is one of the next things that Amrit's going to be working on after wrapping up his current project on amino acid profiles. In any case, it's part of his thesis plan, so it's going to happen in the next 1.5 or so years.

dunleavy005 commented 6 years ago

see PR #254 and #255