Closed matsen closed 6 years ago
why don't you use 'all_matches'? then you won't have to modify the header consistency enforcement stuff. Can turn to ordered dict if you want to preserve order
On Mon, Oct 17, 2016 at 6:18 AM, Erick Matsen notifications@github.com wrote:
We need max and min of the qrbounds for each matched gene, as well as the relative alignment of the read to each matched gene. According to @psathyrella https://github.com/psathyrella , the best strategy is to put it into the infoline data structure.
For a qinfo that looks like this:
{'first_match_qrbounds': (0, 296), 'glbounds': {'IGHD2-15_01': (3, 31), 'IGHD2-2_03': (0, 31), 'IGHJ6_02': (3, 62), 'IGHJ6_03': (3, 62), 'IGHV1-2_02': (0, 296)}, 'matches': {'d': [(140, 'IGHD2-15_01'), (99, 'IGHD2-2_03')], 'j': [(288, 'IGHJ6_02'), (260, 'IGHJ6_03')], 'v': [(1473, 'IGHV1-2_02')]},'name': 'V1-2_04-D2-15_01-J6_01_del3d_Tins296_331_vins28', 'new_indel': False, 'qrbounds': {'IGHD2-15_01': (302, 330), 'IGHD2-2_03': (299, 330), 'IGHJ6_02': (336, 395), 'IGHJ6_03': (336, 395), 'IGHV1-2*02': (0, 296)}, 'seq': 'CAGGTGCAGCTGGTGCAGTCTGGGGCTGAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTCTCCTGCAAGGCTTCTGGATACACCTTCACCGGCTACTATATGCACTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGATGGATCAACCCTAACAGTGGTGGCACAAACTATGCACAGAAGTTTCAGGGCTGGGTCACCATGACCAGGGACACGTCCATCAGCACAGCCTACATGGAGCTGAGCAGGCTGAGATCTGACGACACGGCCGTGTATTACTGTGCGAGAGATTTTTTATATTGTAGTGGTGGTAGCTGCTACTCCGGGGGGACTACTACTACTACGGTATGGACGTCTGGGGGCAAGGGACCACGGTCACCGTCTCCTCAG'}
and an indelinfo like this:
{'V1-2_04-D2-15_01-J6_01_del3d_Tins296_331_vins28': { 'indels': [{'len': 6, 'pos': 27, 'seqstr': 'CCCCCC', 'type': 'insertion'}], 'reversed_seq': 'CAGGTGCAGCTGGTGCAGTCTGGGGCTGAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTCTCCTGCAAGGCTTCTGGATACACCTTCACCGGCTACTATATGCACTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGATGGATCAACCCTAACAGTGGTGGCACAAACTATGCACAGAAGTTTCAGGGCTGGGTCACCATGACCAGGGACACGTCCATCAGCACAGCCTACATGGAGCTGAGCAGGCTGAGATCTGACGACACGGCCGTGTATTACTGTGCGAGAGATTTTTTATATTGTAGTGGTGGTAGCTGCTACTCCGGGGGGACTACTACTACTACGGTATGGACGTCTGGGGGCAAGGGACCACGGTCACCGTCTCCTCAG'}}
I propose augmenting the infoline with two new items,
'qr_relpos': {'IGHD2-15_01': 299, 'IGHD2-2_03': 299, 'IGHJ6_02': 333, 'IGHJ6_03': 333, 'IGHV1-2*02': 0}
and
'qr_boundsbounds': {'d': (299, 330), 'j': (336, 395), 'v': (0, 296)}
The qr_relpos is simply the left hand qrbound minus the left hand glbound, and the boundsbounds is just the max and the min of each of the qrbounds across the genes.
There is already this:
swfo['regional_bounds'][region] = tuple([rb + padleft for rb in swfo['regional_bounds'][region]]) # I kind of want to just use a list now, but a.t.m. don't much feel like changing it everywhere else
which I don't totally get, but which evaluated on the above gives
'regional_bounds': {'d': (302, 330), 'j': (336, 395), 'v': (0, 296)},
so not exactly what I seem to want.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/psathyrella/partis/issues/218, or mute the thread https://github.com/notifications/unsubscribe-auth/AC0rc9teFvIaO6oNpmwD5s8UKsHRxMpBks5q03WYgaJpZM4KYnJi .
Are you proposing implementing my proposed qr_relpos
as a replacement for all_matches
? If so that's great.
What about qr_boundsbounds
?
Add whatever you need as values, as long as you keep the keys and order the same it won't affect anything else
On Oct 17, 2016 8:40 AM, "Erick Matsen" notifications@github.com wrote:
Are you proposing implementing my proposed qr_relpos as a replacement for all_matches? If so that's great.
What about qr_boundsbounds?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/psathyrella/partis/issues/218#issuecomment-254245238, or mute the thread https://github.com/notifications/unsubscribe-auth/AC0rc67YUQFnP8cXgGg8mTRhjpkulr-Aks5q05cFgaJpZM4KYnJi .
OK, so all_matches
looks like {'j': ['IGHJ6*02', 'IGHJ6*03'], 'd': ['IGHD2-15*01', 'IGHD2-2*03'], 'v': ['IGHV1-2*02']}
so this could be wedged in as
{
'j': [(299, 330), [{'IGHJ6*02': 333}, {'IGHJ6*03': 333}]],
...
}
Is that better?
Almost, but you should keep the iteration properties the same
On Oct 17, 2016 9:15 AM, "Erick Matsen" notifications@github.com wrote:
OK, so all_matches looks like {'j': ['IGHJ6_02', 'IGHJ6_03'], 'd': ['IGHD2-15_01', 'IGHD2-2_03'], 'v': ['IGHV1-2*02']} so this could be wedged in as
{'j': [(299, 330), [{'IGHJ6_02': 333}, {'IGHJ6_03': 333}]], ... }
Is that better?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/psathyrella/partis/issues/218#issuecomment-254255221, or mute the thread https://github.com/notifications/unsubscribe-auth/AC0rcyRaxx6gX9kSbScVvU1QHtHcS6Wpks5q058qgaJpZM4KYnJi .
look, just grep for all_matches in the code -- it gets used in exactly one place outside waterer in partitiondriver. Your goal is to make it so that loop doesn't have to change.
After discussion, we agreed that this information doesn't fit neatly into all_matches
, so it's going to go into new columns in the HMM input CSV (actually a space-delimited TSV).
The steps for this are
qr_boundsbounds
and qr_relpos
to linekeys['sw']
in utils.py
get_kbounds
for qr_boundsbounds
in waterer.py
convert_qinfo
to add that information to the infoline in waterer.py
combine_queries
to combine that information in partitiondriver.py
write_to_single_input_file
to write that information out, also in partitiondriver.py
This is all done in draft form, as can be seen in https://github.com/psathyrella/partis/compare/27b978ff7130100b4f98a42bef0b0f70ae7ca6fe...218-linearham, using the old "fuzz" strategy. To make these definitions more clear, here's a picture:
The "flexbounds" are ranges of 0-indexed query-level coordinates of where the various gene matches could start and end. The "relpos" is the query-level coordinate of the untrimmed start of the gene. If the query read doesn't have the full V gene this will be negative.
And here is some example output (see the far right):
names k_v_min k_v_max k_d_min k_d_max mut_freq cdr3_length only_genes seqs flexbounds relpos
Dins_TTT 291 295 12 23 0.0 52 IGHV3-66*03:IGHV3-66*01:IGHJ5*02:IGHD2-15*01:IGHD6-6*01:IGHJ4*02:IGHD6-19*01 GAGGTGCAGCTGGTGGAGTCTGGGGGAGGCTTGGTCCAGCCTGGGGGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCGTCAGTAGCAACTACATGAGCTGGGTCCGCCAGGCTCCAGGGAAGGGGCTGGAGTGGGTCTCAGTTATTTATAGCGGTGGTAGCACATACTACGCAGACTCCGTGAAGGGCAGATTCACCATCTCCAGAGACAATTCCAAGAACACGCTGTATCTTCAAATGAACAGCCTGAGAGCCGAGGACACGGCTGTGTATTACTGTGCGAGAGATTTGAGTATAGCAGCTCGTCCACAACTGGTTCGACCCCTGGGGCCAGGGAACCCTGGTCACCGTCTCCTCAG "{""v_r"":[291,295],""d_r"":[305,316],""j_l"":[312,319],""j_r"":[363,367],""d_l"":[293,298],""v_l"":[0,2]}" "{""IGHV3-66*03"":0,""IGHV3-66*01"":0,""IGHD2-15*01"":283,""IGHD6-6*01"":296,""IGHJ4*02"":317,""IGHD6-19*01"":296,""IGHJ5*02"":314}"
J5pdel_2 291 295 9 20 0.0 47 IGHV3-66*03:IGHV3-66*01:IGHJ5*02:IGHD2-15*01:IGHD6-6*01:IGHJ4*02:IGHD6-19*01 GAGGTGCAGCTGGTGGAGTCTGGGGGAGGCTTGGTCCAGCCTGGGGGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCGTCAGTAGCAACTACATGAGCTGGGTCCGCCAGGCTCCAGGGAAGGGGCTGGAGTGGGTCTCAGTTATTTATAGCGGTGGTAGCACATACTACGCAGACTCCGTGAAGGGCAGATTCACCATCTCCAGAGACAATTCCAAGAACACGCTGTATCTTCAAATGAACAGCCTGAGAGCCGAGGACACGGCTGTGTATTACTGTGCGAGAGAGAGTATAGCAGCTCGTCCAACTGGTTCGACCCCTGGGGCCAGGGAACCCTGGTCACCGTCTCCTCAGNNNNN "{""v_r"":[291,295],""d_r"":[302,313],""j_l"":[308,314],""j_r"":[358,362],""d_l"":[291,300],""v_l"":[0,2]}" "{""IGHV3-66*03"":0,""IGHV3-66*01"":0,""IGHD2-15*01"":280,""IGHD6-6*01"":293,""IGHJ4*02"":312,""IGHD6-19*01"":293,""IGHJ5*02"":309}"
catted 291 295 9 20 0.0 49 IGHV3-66*03:IGHV3-66*01:IGHJ5*02:IGHD2-15*01:IGHD6-6*01:IGHJ4*02:IGHD6-19*01 GAGGTGCAGCTGGTGGAGTCTGGGGGAGGCTTGGTCCAGCCTGGGGGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCGTCAGTAGCAACTACATGAGCTGGGTCCGCCAGGCTCCAGGGAAGGGGCTGGAGTGGGTCTCAGTTATTTATAGCGGTGGTAGCACATACTACGCAGACTCCGTGAAGGGCAGATTCACCATCTCCAGAGACAATTCCAAGAACACGCTGTATCTTCAAATGAACAGCCTGAGAGCCGAGGACACGGCTGTGTATTACTGTGCGAGAGAGAGTATAGCAGCTCGTCCACAACTGGTTCGACCCCTGGGGCCAGGGAACCCTGGTCACCGTCTCCTCAGNNN "{""v_r"":[291,295],""d_r"":[302,313],""j_l"":[309,316],""j_r"":[360,364],""d_l"":[291,300],""v_l"":[0,2]}" "{""IGHV3-66*03"":0,""IGHV3-66*01"":0,""IGHD2-15*01"":280,""IGHD6-6*01"":293,""IGHJ4*02"":314,""IGHD6-19*01"":293,""IGHJ5*02"":311}"
Note: the V gene read position uncertainty in partis is determined by [k_v_min, k_v_max)
(same for D genes).
We'd like to have the amounts by which reads got trimmed at the beginning and end; see https://matsengrp.slack.com/archives/bcell/p1481830831000123 and a few posts after that. Thanks!
ok, I think that's right. Let me know if there's a problem.
As far as the trimming, I think it makes more sense to just use --dont-remove-framework-insertions
, then you don't have to worry about it.
^It's been while since I used partis for bcrham testing, but I believe this is exactly what I did.
Happy dance! Thanks, Duncan!
Just started playing with the --linearham
option and things seem to work ok, but found some minor things that may need adjusting:
1) I was able to get a hmm_input.csv
output file, BUT the err
file said:
terminate called after throwing an instance of 'std::runtime_error'
what(): args.cc: found unexpected header flexbounds' in input file /home/dunleavy005/partis/testing_amrit_results/hmm_input.csv
bcrham never crashed over not knowing what flexbounds/relpos headers were before, so I'm wondering why its yelling now.
2) For some context, my input FASTA file had 4 dummy sequences: --> a catted VDJ germline sequence --> catted seq. with nucleotide padding to the left of the V and the right of the J --> catted seq. with deletions in the middle --> catted seq. with deletions to the left of V and right of J (what you term "convenience" deletions)
Relative to the hmm_input.csv
I obtained on old partis, the new partis does not append any framework N insertions to the left/right of the read for the DELETION sequences only. The flexbounds/relpos are still fine, but I wonder if there was anything substantial you changed since then that might cause this.
oh, I didn't know you were using bcrham to read it -- bcrham has a list of expected headers here that you can add it to, or if that's not clear I can do it.
I'm not sure that I understand the second question, but... without knowing what version of partis made the previous file, and even if I did there's little chance I'd remember all the changes. In any case maybe stop by irl next time you're here?
Yeah, maybe sometime tomorrow after group meeting?
sure
FWIW, I know it seems like this is completely dead, but in fact this is one of the next things that Amrit's going to be working on after wrapping up his current project on amino acid profiles. In any case, it's part of his thesis plan, so it's going to happen in the next 1.5 or so years.
see PR #254 and #255
We need max and min of the
qrbounds
for each matched gene, as well as the relative alignment of the read to each matched gene. According to @psathyrella , the best strategy is to put it into the infoline data structure.For a
qinfo
that looks like this:and an indelinfo like this:
I propose augmenting the
infoline
with two new items,and
The
qr_relpos
is simply the left hand qrbound minus the left hand glbound, and the boundsbounds is just the max and the min of each of the qrbounds across the genes.There is already this:
which I don't totally get, but which evaluated on the above gives
so not exactly what I seem to want.