itsjohncs / 141-assignment

Repository for an assignment for CS 141 at UCR
0 stars 0 forks source link

Potentially incorrect reference sequence being output. #17

Closed itsjohncs closed 10 years ago

itsjohncs commented 10 years ago

The following is a screenshot of one run-through. I'm not sure what's going on yet but I think the score() function isn't returning the full alignment given that the score is probably not what is shown. Will require more investigation.

@AmateurHour do you have any thoughts on this?

screenshot from 2013-12-08 02 09 44

jordanjmeyer commented 10 years ago

The alignment strings may be output incorrectly. I have had a very long day, so I will check this first thing in the morning.

The scores seem reasonable based on the string size. If the input strings are not viewable in the repo, could you make it clear what the top input string is that is instead being printed as "A" so I can thoroughly examine the algorithm's run-through?

Thanks, I'll get back to you as soon as I can On Dec 8, 2013 2:11 AM, "John Sullivan" notifications@github.com wrote:

The following is a screenshot of one run-through. I'm not sure what's going on yet but I think the score() function isn't returning the full alignment given that the score is probably not what is shown. Will require more investigation.

@AmateurHour https://github.com/AmateurHour do you have any thoughts on this?

[image: screenshot from 2013-12-08 02 09 44]https://f.cloud.github.com/assets/367832/1700110/e5542642-5ff0-11e3-91e8-e1cabe8631f0.png

— Reply to this email directly or view it on GitHubhttps://github.com/galah-group/141-assignment/issues/17 .

jordanjmeyer commented 10 years ago

I am looking into the alignment output now. I'm not sure where it's going wrong.

I'll get back to you as soon as I figure out the problem.

On Sun, Dec 8, 2013 at 2:36 AM, Jordan Meyer jmeye006@ucr.edu wrote:

The alignment strings may be output incorrectly. I have had a very long day, so I will check this first thing in the morning.

The scores seem reasonable based on the string size. If the input strings are not viewable in the repo, could you make it clear what the top input string is that is instead being printed as "A" so I can thoroughly examine the algorithm's run-through?

Thanks, I'll get back to you as soon as I can On Dec 8, 2013 2:11 AM, "John Sullivan" notifications@github.com wrote:

The following is a screenshot of one run-through. I'm not sure what's going on yet but I think the score() function isn't returning the full alignment given that the score is probably not what is shown. Will require more investigation.

@AmateurHour https://github.com/AmateurHour do you have any thoughts on this?

[image: screenshot from 2013-12-08 02 09 44]https://f.cloud.github.com/assets/367832/1700110/e5542642-5ff0-11e3-91e8-e1cabe8631f0.png

— Reply to this email directly or view it on GitHubhttps://github.com/galah-group/141-assignment/issues/17 .

itsjohncs commented 10 years ago

Sounds good, I'll be working on things in an hour or so as well. It might just be my output function or main logic, I didn't see anything that would cause it in there though. This might be a tricky bug to hunt down.

jordanjmeyer commented 10 years ago

yeah I believe it's because the reference is a substring of the query. But I'm not sure why it's messing up. I'm also having a friend who is in charge of his groups' algorithm design look at the algorithm, if you don't mind.

On Sun, Dec 8, 2013 at 11:49 AM, John Sullivan notifications@github.comwrote:

Sounds good, I'll be working on things in an hour or so as well. It might just be my output function or main logic, I didn't see anything that would cause it in there though. This might be a tricky bug to hunt down.

— Reply to this email directly or view it on GitHubhttps://github.com/galah-group/141-assignment/issues/17#issuecomment-30090825 .

jordanjmeyer commented 10 years ago

I believe I am running the same tests as you did in the screenshot. I used the "our_tiny_query" string as the reference, and each of: Panthera leo isolate 8c cytochrome b gene, complete cds Panthera leo isolate 8a cytochrome b gene, complete cds Panthera leo isolate 5c cytochrome b gene, complete cds

against the "our_tiny_query" and I am getting scores of 41.15, which is 0.2 off the scores you got ( meaning a gap is being counted in your test and not mine), and aligned strings with leading gaps, ex:

(41.15, '____AGCGAAAGCAGGTCAAATATATTCAATATGGAGAGAATAAAAGAATTAAG', 'ATGACCAACATTCGAAAATCACACCCCCTTGTCAAAATTATTAACCACTCATTCATTGATCTTCCCACTCCACCCAATATCTCAGCATGATGAAACTTTGGCTCCTTATTAGGAGTATGTTTAATCCTACAAATTCTCACCGGCCTCTTTCTAGCCATACATTACACACCAGACACAATAACCGCTTTCTCATCAGTCACCCACATTTGCCGCGATGTAAACTATGGCTGAATTATCCGGTACCTACACGCCAACGGAGCCTCCATATTCTTTATCTGCCTATACATGCATGTAGGACGAGGAATATACTATGGCTCCTATACTTTCTCAGAAACATGAAATATTGGAATCATATTGTTGCTCACAGTTATAGCTACAGCCTTCATAGGATATGTCTTACCGTGGGGCCAAATATCCTTTTGAGGTGCAACTGTAATCACTAATCTCCTATCAGCAATCCCATACATCGGGGCCGACCTAGTAGAGTGGATCTGAGGAGGCTTCTCAGTAGACAAAGCCACCCTGACACGATTCTTTGCCTTCCACTTCATCCTTCCATTTATCATCTCAGCCCTAGCAGCAGTCCACCTCCTATTCCTCCATGAAACAGGATCTAATAACCCCTCAGGAATGGTATCTGACTCAGATAAAATTCCATTCCATCCATACTATACAATCAAAGATATCCTAGGCCTTCTAGTACTAATCCTAACACTCACACTACTCGTCCTATTCTCACCAGACCTATTAGGAGATCCCGATAACTATACCCCCGCCAATCCTCTAAGCACCCCTCCCCATATCAAACCTGAATGGTACTTCCTATTTGCATATGCAATCCTCCGATCTATTCCCAATAAACTAGGAGGAGTTCTAGCCCTAGTTCTATCCATCTTAATCTTAGCAATTATCCCTGCCCTCCACACTTCCAAACAGCGAGGAATAATGTTTCGACCACTAAGTCAATGCTTATTCTGATTCCTAGTAGCGGACCTTCTGACCCTGACATGAATTGGTGGCCAACCTGTAGAACACCCCTTCATCACCATCGGCCAACTAGCCTCCATCCTATACTTCTCCACTCTTCTAATCCTAATACCCATCTCAGGCATTATTGAAAACCGCCTCCTCAAATGAAGAGTCTTCGTAGTATATAGAATACTTTGGTCTTGTAAACCAAAAAAGGAGAACGCGTACCCTCCCTAAGACTTCAAGGAAGAAGCAATAGCCCCACCATCAGCACCCAAAGCTGAAATTCTTTCTTAAACTATTCCTTGCTAATACCAAAAAATAACCCCGTAACTTTCACAATTCATATATTGCATATACCCATACTGTGCTTGCCCAGTATGTCCTTATTCCCCACGAAAAGCAAGTGAAAATCCCCACCCTCCACAACACAAACGCACAATGTAAAATAACCAGTCAACTTTCTTTTTCCCACATACACTGTATCATCGACTACCCTCCCATGAATATTAAGCATGTACAGTAGTTTATATATATTACATAAGGCATACTATGTATATCGGGCATTAACTGCTT____GAATA_AGCATGTAC_CAG_AGT__ATA_ATATATTAC____CTA_ATG_ATATCG_GTG_GCA___ACT__GTC_ATG_GAATA_TAAGC____AGT_AGTTATA_ATTAC____AAG')

I believe this is the correct string alignment because the end of the query (Panthera leo isolate 5c cytochrome b gene, complete cds) matches up with the reference ("our_tiny_query") relatively well.

I will continue to examine the tests and find where the 0.2 difference is coming from.

On Sun, Dec 8, 2013 at 11:53 AM, Jordan Meyer jmeye006@ucr.edu wrote:

yeah I believe it's because the reference is a substring of the query. But I'm not sure why it's messing up. I'm also having a friend who is in charge of his groups' algorithm design look at the algorithm, if you don't mind.

On Sun, Dec 8, 2013 at 11:49 AM, John Sullivan notifications@github.comwrote:

Sounds good, I'll be working on things in an hour or so as well. It might just be my output function or main logic, I didn't see anything that would cause it in there though. This might be a tricky bug to hunt down.

— Reply to this email directly or view it on GitHubhttps://github.com/galah-group/141-assignment/issues/17#issuecomment-30090825 .

itsjohncs commented 10 years ago

I think your code is shortening the reference sequence you return to be the same size as the query sequence, and we end up losing data. Here's some debugging output from me putzing around...

name: gi|253409428|ref|GQ227366.1| Influenza A virus (A/pika/Qinghai/BI/2007(H5N1)) segment 1 polymerase PB2 (PB2) gene, complete cds
score: 39.0

organism.sequence: ATGGAGAGAATAAAGGAATTAAGAGATCTAATGTCACAGTCCCGCACTCGCGAGATACTAACAAAGACCACTGTGGACCATATGGCCATAATCAAGAAATACACATCAGGAAGACAAGAGAAGAACCCTGCTCTCAGAATGAAATGGATGATGGCAATGAAATATCCAATCACAGCGGACAAGAGAATAATAGAGATGATTCCTGAAAGGAATGAACAAGGACAGACACTCTGGAGCAAGACAAATGATGCTGGATCGGACAGGGTGATGGTGTCTCCCCTAGCTGTAACTTGGTGGAATAGGAATGGGCCGACGACAAGTACAGTTCATTATCCAAAGGTTTACAAAACATACTTTGAGAAGGTTGAAAGGTTAAAACATGGAACCTTCGGTCCCGTTCATTTCCGAAACCAAGTTAAAATACGCCGCCGAGTTGATACAAATCCTGGCCATGCAGATCTCAGTGCTAAAGAAGCACAAGATGTCATCATGGAGGTCGTTTTCCCAAATGAAGTGGGAGCTAGAATATTGACTTCAGAGTCACAGTTGACAATAACGAAAGAGAAAAAAGAAGAGCTCCAAGATTGTAAGATTGCTCCCTTAATGGTTGCATACATGTTGGAAAGGGAACTGGTCCGCAAAACCAGATTCCTACCAGTAGCAGGCGGAACAAGCAGTGTGTACATTGAGGTATTGCATTTGACTCAAGGAACCTGCTGGGCACAGATGTACACTCCAGGCGGAGAAGTAAGAAATGACGATGTTGACCAGAGTTTGATCATTGCTGCCAGAAACATTGTTAGGAGAGCAACGGTATCAGCGGATCCACTGGCATCACTGCTGGAGATGTGTCACAGCACACAAATTGGTGGGATAAGGATGGTGGACATCCTTAGGCAAACTCCAACTGAGGAACAAGCTGTGGATATATGCAAAGCAGCAATGGGTCTGAGGATTAGTTCATCCTTTAGCTTTGGAGGCTTCACTTTCAAAAGAACAAGTGGATCATCCGCCACGAAGGAAGAGGAAGTGCTTACAGGCAACCTCCAAACATTGAAAATAAGAGTACATGAGGGGTATGAGGAGTTCACAATGGTTGGGCAGAGGGCAACAGCTATCCTGAGGAAAGCAACTAGAAGGCTGATTCAGTTGATAGTAAGTGGAAGAAACGAACAATCAATCGCTGAGGCAATCATTGTAGCAATGGTGTTCTCACAGGAGGATCGCATGATAAAAGCAGTCCGAGGCGATCTGAATTTCGTAAACAGAGCAAACCAAAGATTAAACCCCATGCATCAACTCCTGAGACATTTTCAAAAGGACGCAAAAGTGCTATTTCAGAATTGGGGAACTGAGCCAATTGATAATGTCATGGGGATGATCGGAATATTACCTGACATGACTCCCAGCACAGAAACGTCACTGAGAGGAGTGAGAGTTAGTAAAATGGGAGTAGATGAGTATTCCAGCACTGAGAGAGTAGTTGTAAGCATTGACCGCTTCTTAAGGGTTCGAGACCAGCGGGGGAACGTACTCTTATCTCCCGAAGAGGTCAGCGAAACCCAGGGAACAGAGAAGTTGACAATAACATATTCATCATCAATGATGTGGGAAATCAACGGTCCTGAGTCAGTGCTTGTTAACACTTACCAATGGATCATTAGAAACTGGGAGACCGTGAAAATTCAGTGGTCTCAGGACCCCACGATGTTGTACAATAAGATGGAGTTTGAACCGTTCCAATCCTTGGTACCTAAAGCTGCCAGAGGTCAATACAGTGGATTTGTGAGAACATTATTCCAACAAATGCGTGACGTACTGGGGACATTTGATACTGTCCAGATAATAAAGCTGCTACCATTTGCAGCAGCCCCACCGAAGCAGAGCAGAATGCAGTTTTCTTCTCTAACTGTGAATGTGAGAGGCTCAGGAATGAGAATACTCATAAGGGGCAATTCCCCTGTGTTCAACTACAATAAGGCAACCCAAAGACTTACCGTTCTTGGAAAGGACGCAGGTGCATTAACAGAGGATCCAGATGAGGGGACAGCCGGAGTGGAATCTGCAGTACTGAGGGGGTTCCTAATTCTAGGCAAGGAGGACAAAAGATATGGACCAGCATTGAGCATCAATGAACTGAGCAATCTTGCAAAAGGGGAGAAAGCTAATGTGCTGATAGGGCAAGGAGACGTGGTGTTGGTAATGAAACGGAAACGGGACTCTAGCATACTTACTGACAGCCAGACAGCGACCAAAAGAATTCGGATGGCCATCAATTAGTGTCGAATTGTTTAAAAACGACCTTGTTTCTACT
reference_alignment: ________________________________________________

query: AGCGAAAGCAGGTCAAATATATTCAATATGGAGAGAATAAAAGAATTAAG

query_alignment: GCGAAAGCAGGTCAAATATATTCAATATGGAGAGAATAAAAGAATTAAG
jordanjmeyer commented 10 years ago

I think you're onto something. I will try to fix the problem as soon as I get back to a computer. On Dec 8, 2013 2:10 PM, "John Sullivan" notifications@github.com wrote:

I think your code is shortening the reference sequence you return to be the same size as the query sequence, and we end up losing data. Here's some debugging output from me putzing around...

name: gi|253409428|ref|GQ227366.1| Influenza A virus (A/pika/Qinghai/BI/2007(H5N1)) segment 1 polymerase PB2 (PB2) gene, complete cds score: 39.0

organism.sequence: ATGGAGAGAATAAAGGAATTAAGAGATCTAATGTCACAGTCCCGCACTCGCGAGATACTAACAAAGACCACTGTGGACCATATGGCCATAATCAAGAAATACACATCAGGAAGACAAGAGAAGAACCCTGCTCTCAGAATGAAATGGATGATGGCAATGAAATATCCAATCACAGCGGACAAGAGAATAATAGAGATGATTCCTGAAAGGAATGAACAAGGACAGACACTCTGGAGCAAGACAAATGATGCTGGATCGGACAGGGTGATGGTGTCTCCCCTAGCTGTAACTTGGTGGAATAGGAATGGGCCGACGACAAGTACAGTTCATTATCCAAAGGTTTACAAAACATACTTTGAGAAGGTTGAAAGGTTAAAACATGGAACCTTCGGTCCCGTTCATTTCCGAAACCAAGTTAAAATACGCCGCCGAGTTGATACAAATCCTGGCCATGCAGATCTCAGTGCTAAAGAAGCACAAGATGTCATCATGGAGGTCGTTTTCCCAAATGAAGTGGGAGCTAGAATATTGACTTCAGAGTCACAGTTGACAATAACGAAAGAGAAAAAAGAAGAGCTCCAAGATTGTAAGATTGCTCCCTTAATGGTTGCATACATGTTGGAAAGGGAACTGGTCCGCAAAACCAGATTCCTACCAGTAGCAGGCGGAACAAGCAGTGTGTACATTGAGGTATTGCATTTGACTCAAGGAACCTGCTGGGCACAGATGTACACTCCAGGCGGAGAAGTAAGAAATGACGATGTTGACCAGAGTTTGATCATTGCTGCCAGAAACATTGTTAGGAGAGCAACGGTATCAGCGGATCCACTGGCATCACTGCTGGAGATGTGTCACAGCACACAAATTGGTGGGATAAGGATGGTGGACATCCTTAGGCAAACTCCAACTGAGGAACAAGCTGTGGATATATGCAAAGCAGCAATGGGTCTGAGGATTAGTTCATCCTTTAGCTTTGGAG GCTTCACTTTCAAAAGAACAAGTGGATCATCCGCCACGAAGGAAGAGGAAGTGCTTACAGGCAACCTCCAAACATTGAAAATAAGAGTACATGAGGGGTATGAGGAGTTCACAATGGTTGGGCAGAGGGCAACAGCTATCCTGAGGAAAGCAACTAGAAGGCTGATTCAGTTGATAGTAAGTGGAAGAAACGAACAATCAATCGCTGAGGCAATCATTGTAGCAATGGTGTTCTCACAGGAGGATCGCATGATAAAAGCAGTCCGAGGCGATCTGAATTTCGTAAACAGAGCAAACCAAAGATTAAACCCCATGCATCAACTCCTGAGACATTTTCAAAAGGACGCAAAAGTGCTATTTCAGAATTGGGGAACTGAGCCAATTGATAATGTCATGGGGATGATCGGAATATTACCTGACATGACTCCCAGCACAGAAACGTCACTGAGAGGAGTGAGAGTTAGTAAAATGGGAGTAGATGAGTATTCCAGCACTGAGAGAGTAGTTGTAAGCATTGACCGCTTCTTAAGGGTTCGAGACCAGCGGGGGAACGTACTCTTATCTCCCGAAGAGGTCAGCGAAACCCAGGGAACAGAGAAGTTGACAATAACATATTCATCATCAATGATGTGGGAAATCAACGGTCCTGAGTCAGTGCTTGTTAACACTTACCAATGGATCATTAGAAACTGGGAGACCGTGAAAATTCAGTGGTCTCAGGACCCCACGATGTTGTACAATAAGATGGAGTTTGAACCGTTCCAATCCTTGGTACCTAAAGCTGCCAGAGGTCAATACAGTGGATTTGTGAGAACATTATTCCAACAAATGCGTGACGTACTGGGGACATTTGATACTGTCCAGATAATAAAGCTGCTACCATTTGCAGCAGCCCCACCGAAGCAGAGCAGAATGCAGTTTTCTTCTCTAACTGTGAATGTGAGAGGCTCAGGAATGAGAATACTCATAAGGGGCAATTCCCCTGTGTTCAACTACAA TAAGGCAACCCAAAGACTTACCGTTCTTGGAAAGGACGCAGGTGCATTAACAGAGGATCCAGATGAGGGGACAGCCGGAGTGGAATCTGCAGTACTGAGGGGGTTCCTAATTCTAGGCAAGGAGGACAAAAGATATGGACCAGCATTGAGCATCAATGAACTGAGCAATCTTGCAAAAGGGGAGAAAGCTAATGTGCTGATAGGGCAAGGAGACGTGGTGTTGGTAATGAAACGGAAACGGGACTCTAGCATACTTACTGACAGCCAGACAGCGACCAAAAGAATTCGGATGGCCATCAATTAGTGTCGAATTGTTTAAAAACGACCTTGTTTCTACT reference_alignment: ____

query: AGCGAAAGCAGGTCAAATATATTCAATATGGAGAGAATAAAAGAATTAAG

query_alignment: GCGAAAGCAGGTCAAATATATTCAATATGGAGAGAATAAAAGAATTAAG

— Reply to this email directly or view it on GitHubhttps://github.com/galah-group/141-assignment/issues/17#issuecomment-30094331 .

jordanjmeyer commented 10 years ago

I believe that is happening because you may be using the query as the reference and the organism's sequence as the reference, when it should be the other way around. The reference is the string we are searching for (usually it will be the shorter one), and the query is the one we are questioning against the reference. The query is should be the organism's sequence, and the reference should be the string all searches are based on.

I think this was another mismatch in the spec's portrayal of query vs reference. Originally in our stub for the scoring function, the function signature was def score(query_string, other_string, sub_score, gap_score):

I had changed the signature of my implementation to match the clarification Nick made during lab last week. What I originally thought was the query is actually the reference, and what I thought was just the other string is actually the query.

If you run the tests with the inputs swapped so that what you have labeled as the query is passed in as the reference, and the organism's sequence is passed in as the second argument(as the actual query parameter), the alignments should be correct.

I hope I was clear in explaining what I believe to be the problem.

On Sun, Dec 8, 2013 at 2:15 PM, Jordan Meyer jmeye006@ucr.edu wrote:

I think you're onto something. I will try to fix the problem as soon as I get back to a computer. On Dec 8, 2013 2:10 PM, "John Sullivan" notifications@github.com wrote:

I think your code is shortening the reference sequence you return to be the same size as the query sequence, and we end up losing data. Here's some debugging output from me putzing around...

name: gi|253409428|ref|GQ227366.1| Influenza A virus (A/pika/Qinghai/BI/2007(H5N1)) segment 1 polymerase PB2 (PB2) gene, complete cds score: 39.0

organism.sequence: ATGGAGAGAATAAAGGAATTAAGAGATCTAATGTCACAGTCCCGCACTCGCGAGATACTAACAAAGACCACTGTGGACCATATGGCCATAATCAAGAAATACACATCAGGAAGACAAGAGAAGAACCCTGCTCTCAGAATGAAATGGATGATGGCAATGAAATATCCAATCACAGCGGACAAGAGAATAATAGAGATGATTCCTGAAAGGAATGAACAAGGACAGACACTCTGGAGCAAGACAAATGATGCTGGATCGGACAGGGTGATGGTGTCTCCCCTAGCTGTAACTTGGTGGAATAGGAATGGGCCGACGACAAGTACAGTTCATTATCCAAAGGTTTACAAAACATACTTTGAGAAGGTTGAAAGGTTAAAACATGGAACCTTCGGTCCCGTTCATTTCCGAAACCAAGTTAAAATACGCCGCCGAGTTGATACAAATCCTGGCCATGCAGATCTCAGTGCTAAAGAAGCACAAGATGTCATCATGGAGGTCGTTTTCCCAAATGAAGTGGGAGCTAGAATATTGACTTCAGAGTCACAGTTGACAATAACGAAAGAGAAAAAAGAAGAGCTCCAAGATTGTAAGATTGCTCCCTTAATGGTTGCATACATGTTGGAAAGGGAACTGGTCCGCAAAACCAGATTCCTACCAGTAGCAGGCGGAACAAGCAGTGTGTACATTGAGGTATTGCATTTGACTCAAGGAACCTGCTGGGCACAGATGTACACTCCAGGCGGAGAAGTAAGAAATGACGATGTTGACCAGAGTTTGATCATTGCTGCCAGAAACATTGTTAGGAGAGCAACGGTATCAGCGGATCCACTGGCATCACTGCTGGAGATGTGTCACAGCACACAAATTGGTGGGATAAGGATGGTGGACATCCTTAGGCAAACTCCAACTGAGGAACAAGCTGTGGATATATGCAAAGCAGCAATGGGTCTGAGGATTAGTTCATCCTTTAGCTTTGGAG GCTTCACTTTCAAAAGAACAAGTGGATCATCCGCCACGAAGGAAGAGGAAGTGCTTACAGGCAACCTCCAAACATTGAAAATAAGAGTACATGAGGGGTATGAGGAGTTCACAATGGTTGGGCAGAGGGCAACAGCTATCCTGAGGAAAGCAACTAGAAGGCTGATTCAGTTGATAGTAAGTGGAAGAAACGAACAATCAATCGCTGAGGCAATCATTGTAGCAATGGTGTTCTCACAGGAGGATCGCATGATAAAAGCAGTCCGAGGCGATCTGAATTTCGTAAACAGAGCAAACCAAAGATTAAACCCCATGCATCAACTCCTGAGACATTTTCAAAAGGACGCAAAAGTGCTATTTCAGAATTGGGGAACTGAGCCAATTGATAATGTCATGGGGATGATCGGAATATTACCTGACATGACTCCCAGCACAGAAACGTCACTGAGAGGAGTGAGAGTTAGTAAAATGGGAGTAGATGAGTATTCCAGCACTGAGAGAGTAGTTGTAAGCATTGACCGCTTCTTAAGGGTTCGAGACCAGCGGGGGAACGTACTCTTATCTCCCGAAGAGGTCAGCGAAACCCAGGGAACAGAGAAGTTGACAATAACATATTCATCATCAATGATGTGGGAAATCAACGGTCCTGAGTCAGTGCTTGTTAACACTTACCAATGGATCATTAGAAACTGGGAGACCGTGAAAATTCAGTGGTCTCAGGACCCCACGATGTTGTACAATAAGATGGAGTTTGAACCGTTCCAATCCTTGGTACCTAAAGCTGCCAGAGGTCAATACAGTGGATTTGTGAGAACATTATTCCAACAAATGCGTGACGTACTGGGGACATTTGATACTGTCCAGATAATAAAGCTGCTACCATTTGCAGCAGCCCCACCGAAGCAGAGCAGAATGCAGTTTTCTTCTCTAACTGTGAATGTGAGAGGCTCAGGAATGAGAATACTCATAAGGGGCAATTCCCCTGTGTTCAACTACAA TAAGGCAACCCAAAGACTTACCGTTCTTGGAAAGGACGCAGGTGCATTAACAGAGGATCCAGATGAGGGGACAGCCGGAGTGGAATCTGCAGTACTGAGGGGGTTCCTAATTCTAGGCAAGGAGGACAAAAGATATGGACCAGCATTGAGCATCAATGAACTGAGCAATCTTGCAAAAGGGGAGAAAGCTAATGTGCTGATAGGGCAAGGAGACGTGGTGTTGGTAATGAAACGGAAACGGGACTCTAGCATACTTACTGACAGCCAGACAGCGACCAAAAGAATTCGGATGGCCATCAATTAGTGTCGAATTGTTTAAAAACGACCTTGTTTCTACT reference_alignment: ____

query: AGCGAAAGCAGGTCAAATATATTCAATATGGAGAGAATAAAAGAATTAAG

query_alignment: GCGAAAGCAGGTCAAATATATTCAATATGGAGAGAATAAAAGAATTAAG

— Reply to this email directly or view it on GitHubhttps://github.com/galah-group/141-assignment/issues/17#issuecomment-30094331 .

jordanjmeyer commented 10 years ago

Passing in the arguments in reverse order would cause the initial error you showed me with the screenshot, as well as why our scores suffered by 0.2, because of how the initial scoring matrix is built.

I think I had just spent so much time figuring out all the stupid stuff that is unclear in the spec that I completely forgot to let you know how the parameters actually function. I'm sorry for all the confusion and probably frustration. On Dec 8, 2013 2:36 PM, "Jordan Meyer" jmeye006@ucr.edu wrote:

I believe that is happening because you may be using the query as the reference and the organism's sequence as the reference, when it should be the other way around. The reference is the string we are searching for (usually it will be the shorter one), and the query is the one we are questioning against the reference. The query is should be the organism's sequence, and the reference should be the string all searches are based on.

I think this was another mismatch in the spec's portrayal of query vs reference. Originally in our stub for the scoring function, the function signature was def score(query_string, other_string, sub_score, gap_score):

I had changed the signature of my implementation to match the clarification Nick made during lab last week. What I originally thought was the query is actually the reference, and what I thought was just the other string is actually the query.

If you run the tests with the inputs swapped so that what you have labeled as the query is passed in as the reference, and the organism's sequence is passed in as the second argument(as the actual query parameter), the alignments should be correct.

I hope I was clear in explaining what I believe to be the problem.

On Sun, Dec 8, 2013 at 2:15 PM, Jordan Meyer jmeye006@ucr.edu wrote:

I think you're onto something. I will try to fix the problem as soon as I get back to a computer. On Dec 8, 2013 2:10 PM, "John Sullivan" notifications@github.com wrote:

I think your code is shortening the reference sequence you return to be the same size as the query sequence, and we end up losing data. Here's some debugging output from me putzing around...

name: gi|253409428|ref|GQ227366.1| Influenza A virus (A/pika/Qinghai/BI/2007(H5N1)) segment 1 polymerase PB2 (PB2) gene, complete cds score: 39.0

organism.sequence: ATGGAGAGAATAAAGGAATTAAGAGATCTAATGTCACAGTCCCGCACTCGCGAGATACTAACAAAGACCACTGTGGACCATATGGCCATAATCAAGAAATACACATCAGGAAGACAAGAGAAGAACCCTGCTCTCAGAATGAAATGGATGATGGCAATGAAATATCCAATCACAGCGGACAAGAGAATAATAGAGATGATTCCTGAAAGGAATGAACAAGGACAGACACTCTGGAGCAAGACAAATGATGCTGGATCGGACAGGGTGATGGTGTCTCCCCTAGCTGTAACTTGGTGGAATAGGAATGGGCCGACGACAAGTACAGTTCATTATCCAAAGGTTTACAAAACATACTTTGAGAAGGTTGAAAGGTTAAAACATGGAACCTTCGGTCCCGTTCATTTCCGAAACCAAGTTAAAATACGCCGCCGAGTTGATACAAATCCTGGCCATGCAGATCTCAGTGCTAAAGAAGCACAAGATGTCATCATGGAGGTCGTTTTCCCAAATGAAGTGGGAGCTAGAATATTGACTTCAGAGTCACAGTTGACAATAACGAAAGAGAAAAAAGAAGAGCTCCAAGATTGTAAGATTGCTCCCTTAATGGTTGCATACATGTTGGAAAGGGAACTGGTCCGCAAAACCAGATTCCTACCAGTAGCAGGCGGAACAAGCAGTGTGTACATTGAGGTATTGCATTTGACTCAAGGAACCTGCTGGGCACAGATGTACACTCCAGGCGGAGAAGTAAGAAATGACGATGTTGACCAGAGTTTGATCATTGCTGCCAGAAACATTGTTAGGAGAGCAACGGTATCAGCGGATCCACTGGCATCACTGCTGGAGATGTGTCACAGCACACAAATTGGTGGGATAAGGATGGTGGACATCCTTAGGCAAACTCCAACTGAGGAACAAGCTGTGGATATATGCAAAGCAGCAATGGGTCTGAGGATTAGTTCATCCTTTAGCTTTGGAG GCTTCACTTTCAAAAGAACAAGTGGATCATCCGCCACGAAGGAAGAGGAAGTGCTTACAGGCAACCTCCAAACATTGAAAATAAGAGTACATGAGGGGTATGAGGAGTTCACAATGGTTGGGCAGAGGGCAACAGCTATCCTGAGGAAAGCAACTAGAAGGCTGATTCAGTTGATAGTAAGTGGAAGAAACGAACAATCAATCGCTGAGGCAATCATTGTAGCAATGGTGTTCTCACAGGAGGATCGCATGATAAAAGCAGTCCGAGGCGATCTGAATTTCGTAAACAGAGCAAACCAAAGATTAAACCCCATGCATCAACTCCTGAGACATTTTCAAAAGGACGCAAAAGTGCTATTTCAGAATTGGGGAACTGAGCCAATTGATAATGTCATGGGGATGATCGGAATATTACCTGACATGACTCCCAGCACAGAAACGTCACTGAGAGGAGTGAGAGTTAGTAAAATGGGAGTAGATGAGTATTCCAGCACTGAGAGAGTAGTTGTAAGCATTGACCGCTTCTTAAGGGTTCGAGACCAGCGGGGGAACGTACTCTTATCTCCCGAAGAGGTCAGCGAAACCCAGGGAACAGAGAAGTTGACAATAACATATTCATCATCAATGATGTGGGAAATCAACGGTCCTGAGTCAGTGCTTGTTAACACTTACCAATGGATCATTAGAAACTGGGAGACCGTGAAAATTCAGTGGTCTCAGGACCCCACGATGTTGTACAATAAGATGGAGTTTGAACCGTTCCAATCCTTGGTACCTAAAGCTGCCAGAGGTCAATACAGTGGATTTGTGAGAACATTATTCCAACAAATGCGTGACGTACTGGGGACATTTGATACTGTCCAGATAATAAAGCTGCTACCATTTGCAGCAGCCCCACCGAAGCAGAGCAGAATGCAGTTTTCTTCTCTAACTGTGAATGTGAGAGGCTCAGGAATGAGAATACTCATAAGGGGCAATTCCCCTGTGTTCAACTACAA TAAGGCAACCCAAAGACTTACCGTTCTTGGAAAGGACGCAGGTGCATTAACAGAGGATCCAGATGAGGGGACAGCCGGAGTGGAATCTGCAGTACTGAGGGGGTTCCTAATTCTAGGCAAGGAGGACAAAAGATATGGACCAGCATTGAGCATCAATGAACTGAGCAATCTTGCAAAAGGGGAGAAAGCTAATGTGCTGATAGGGCAAGGAGACGTGGTGTTGGTAATGAAACGGAAACGGGACTCTAGCATACTTACTGACAGCCAGACAGCGACCAAAAGAATTCGGATGGCCATCAATTAGTGTCGAATTGTTTAAAAACGACCTTGTTTCTACT reference_alignment: ____

query: AGCGAAAGCAGGTCAAATATATTCAATATGGAGAGAATAAAAGAATTAAG

query_alignment: GCGAAAGCAGGTCAAATATATTCAATATGGAGAGAATAAAAGAATTAAG

— Reply to this email directly or view it on GitHubhttps://github.com/galah-group/141-assignment/issues/17#issuecomment-30094331 .

itsjohncs commented 10 years ago

No problem. Looks like there might be another issue though. I don't think we're getting the ideal alignment, and I'm still not sure the score is being calculated correctly.

It looks like the smaller string is always padded with gaps on the left to excess. And I'm looking at one of the results right now and it definitely wouldn't add up to the score that we're getting. Try running the tiny query string against gi|544618446|ref|GU131178.2| Panthera leo isolate 5c cytochrome b gene, complete cds. I can send a screenshot if necessary.

On Sun, Dec 8, 2013 at 2:51 PM, AmateurHour notifications@github.comwrote:

Passing in the arguments in reverse order would cause the initial error you showed me with the screenshot, as well as why our scores suffered by 0.2, because of how the initial scoring matrix is built.

I think I had just spent so much time figuring out all the stupid stuff that is unclear in the spec that I completely forgot to let you know how the parameters actually function. I'm sorry for all the confusion and probably frustration. On Dec 8, 2013 2:36 PM, "Jordan Meyer" jmeye006@ucr.edu wrote:

I believe that is happening because you may be using the query as the reference and the organism's sequence as the reference, when it should be the other way around. The reference is the string we are searching for (usually it will be the shorter one), and the query is the one we are questioning against the reference. The query is should be the organism's sequence, and the reference should be the string all searches are based on.

I think this was another mismatch in the spec's portrayal of query vs reference. Originally in our stub for the scoring function, the function signature was def score(query_string, other_string, sub_score, gap_score):

I had changed the signature of my implementation to match the clarification Nick made during lab last week. What I originally thought was the query is actually the reference, and what I thought was just the other string is actually the query.

If you run the tests with the inputs swapped so that what you have labeled as the query is passed in as the reference, and the organism's sequence is passed in as the second argument(as the actual query parameter), the alignments should be correct.

I hope I was clear in explaining what I believe to be the problem.

On Sun, Dec 8, 2013 at 2:15 PM, Jordan Meyer jmeye006@ucr.edu wrote:

I think you're onto something. I will try to fix the problem as soon as I get back to a computer. On Dec 8, 2013 2:10 PM, "John Sullivan" notifications@github.com wrote:

I think your code is shortening the reference sequence you return to be the same size as the query sequence, and we end up losing data. Here's some debugging output from me putzing around...

name: gi|253409428|ref|GQ227366.1| Influenza A virus (A/pika/Qinghai/BI/2007(H5N1)) segment 1 polymerase PB2 (PB2) gene, complete cds score: 39.0

organism.sequence: ATGGAGAGAATAAAGGAATTAAGAGATCTAATGTCACAGTCCCGCACTCGCGAGATACTAACAAAGACCACTGTGGACCATATGGCCATAATCAAGAAATACACATCAGGAAGACAAGAGAAGAACCCTGCTCTCAGAATGAAATGGATGATGGCAATGAAATATCCAATCACAGCGGACAAGAGAATAATAGAGATGATTCCTGAAAGGAATGAACAAGGACAGACACTCTGGAGCAAGACAAATGATGCTGGATCGGACAGGGTGATGGTGTCTCCCCTAGCTGTAACTTGGTGGAATAGGAATGGGCCGACGACAAGTACAGTTCATTATCCAAAGGTTTACAAAACATACTTTGAGAAGGTTGAAAGGTTAAAACATGGAACCTTCGGTCCCGTTCATTTCCGAAACCAAGTTAAAATACGCCGCCGAGTTGATACAAATCCTGGCCATGCAGATCTCAGTGCTAAAGAAGCACAAGATGTCATCATGGAGGTCGTTTTCCCAAATGAAGTGGGAGCTAGAATATTGACTTCAGAGTCACAGTTGACAATAACGAAAGAGAAAAAAGAAGAGCTCCAAGATTGTAAGATTGCTCCCTTAATGGTTGCATACATGTTGGAAAGGGAACTGGTCCGCAAAACCAGATTCCTACCAGTAGCAGGCGGAACAAGCAGTGTGTACATTGAGGTATTGCATTTGACTCAAGGAACCTGCTGGGCACAGATGTACACTCCAGGCGGAGAAGTAAGAAATGACGATGTTGACCAGAGTTTGATCATTGCTGCCAGAAACATTGTTAGGAGAGCAACGGTATCAGCGGATCCACTGGCATCACTGCTGGAGATGTGTCACAGCACACAAATTGGTGGGATAAGGATGGTGGACATCCTTAGGCAAACTCCAACTGAGGAACAAGCTGTGGATATATGCAAAGCAGCAATGGGTCTGAGGATTAGTTCATCCTTTAGCTTTGGAG

GCTTCACTTTCAAAAGAACAAGTGGATCATCCGCCACGAAGGAAGAGGAAGTGCTTACAGGCAACCTCCAAACATTGAAAATAAGAGTACATGAGGGGTATGAGGAGTTCACAATGGTTGGGCAGAGGGCAACAGCTATCCTGAGGAAAGCAACTAGAAGGCTGATTCAGTTGATAGTAAGTGGAAGAAACGAACAATCAATCGCTGAGGCAATCATTGTAGCAATGGTGTTCTCACAGGAGGATCGCATGATAAAAGCAGTCCGAGGCGATCTGAATTTCGTAAACAGAGCAAACCAAAGATTAAACCCCATGCATCAACTCCTGAGACATTTTCAAAAGGACGCAAAAGTGCTATTTCAGAATTGGGGAACTGAGCCAATTGATAATGTCATGGGGATGATCGGAATATTACCTGACATGACTCCCAGCACAGAAACGTCACTGAGAGGAGTGAGAGTTAGTAAAATGGGAGTAGATGAGTATTCCAGCACTGAGAGAGTAGTTGTAAGCATTGACCGCTTCTTAAGGGTTCGAGACCAGCGGGGGAACGTACTCTTATCTCCCGAAGAGGTCAGCGAAACCCAGGGAACAGAGAAGTTGACAATAACATATTCATCATCAATGATGTGGGAAATCAACGGTCCTGAGTCAGTGCTTGTTAACACTTACCAATGGATCATTAGAAACTGGGAGACCGTGAAAATTCAGTGGTCTCAGGACCCCACGATGTTGTACAATAAGATGGAGTTTGAACCGTTCCAATCCTTGGTACCTAAAGCTGCCAGAGGTCAATACAGTGGATTTGTGAGAACATTATTCCAACAAATGCGTGACGTACTGGGGACATTTGATACTGTCCAGATAATAAAGCTGCTACCATTTGCAGCAGCCCCACCGAAGCAGAGCAGAATGCAGTTTTCTTCTCTAACTGTGAATGTGAGAGGCTCAGGAATGAGAATACTCATAAGGGGCAATTCCCCTGTGTTCAACTACAA

TAAGGCAACCCAAAGACTTACCGTTCTTGGAAAGGACGCAGGTGCATTAACAGAGGATCCAGATGAGGGGACAGCCGGAGTGGAATCTGCAGTACTGAGGGGGTTCCTAATTCTAGGCAAGGAGGACAAAAGATATGGACCAGCATTGAGCATCAATGAACTGAGCAATCTTGCAAAAGGGGAGAAAGCTAATGTGCTGATAGGGCAAGGAGACGTGGTGTTGGTAATGAAACGGAAACGGGACTCTAGCATACTTACTGACAGCCAGACAGCGACCAAAAGAATTCGGATGGCCATCAATTAGTGTCGAATTGTTTAAAAACGACCTTGTTTCTACT

reference_alignment: ____

query: AGCGAAAGCAGGTCAAATATATTCAATATGGAGAGAATAAAAGAATTAAG

query_alignment: GCGAAAGCAGGTCAAATATATTCAATATGGAGAGAATAAAAGAATTAAG

— Reply to this email directly or view it on GitHub< https://github.com/galah-group/141-assignment/issues/17#issuecomment-30094331>

.

— Reply to this email directly or view it on GitHubhttps://github.com/galah-group/141-assignment/issues/17#issuecomment-30095435 .

John Sullivan (johnsullivan.name) Supplemental Instruction Mentor at UCR gpg --recv-keys 6A262D84

jordanjmeyer commented 10 years ago

Alright, I think changing this part of backtrace:

elif bt[i][j] == INSERTION: if len(b) > 0 and b[len(b)-1] != '' or len(b) == 0: b += query[j] b += '' j -= 1

to this:

elif bt[i][j] == INSERTION: if len(b) > 0 and b[len(b)-1] != '' or len(b) == 0: b += query[j] b += '' a += '_' j -= 1 yields the proper alignment. The score makes much more sense like that. The single line of code different will not alter the score, and still results in the correct results for the original test cases.

On Sun, Dec 8, 2013 at 2:58 PM, John Sullivan notifications@github.comwrote:

No problem. Looks like there might be another issue though. I don't think we're getting the ideal alignment, and I'm still not sure the score is being calculated correctly.

It looks like the smaller string is always padded with gaps on the left to excess. And I'm looking at one of the results right now and it definitely wouldn't add up to the score that we're getting. Try running the tiny query string against gi|544618446|ref|GU131178.2| Panthera leo isolate 5c cytochrome b gene, complete cds. I can send a screenshot if necessary.

On Sun, Dec 8, 2013 at 2:51 PM, AmateurHour notifications@github.comwrote:

Passing in the arguments in reverse order would cause the initial error you showed me with the screenshot, as well as why our scores suffered by 0.2, because of how the initial scoring matrix is built.

I think I had just spent so much time figuring out all the stupid stuff that is unclear in the spec that I completely forgot to let you know how the parameters actually function. I'm sorry for all the confusion and probably frustration. On Dec 8, 2013 2:36 PM, "Jordan Meyer" jmeye006@ucr.edu wrote:

I believe that is happening because you may be using the query as the reference and the organism's sequence as the reference, when it should be the other way around. The reference is the string we are searching for (usually it will be the shorter one), and the query is the one we are questioning against the reference. The query is should be the organism's sequence, and the reference should be the string all searches are based on.

I think this was another mismatch in the spec's portrayal of query vs reference. Originally in our stub for the scoring function, the function signature was def score(query_string, other_string, sub_score, gap_score):

I had changed the signature of my implementation to match the clarification Nick made during lab last week. What I originally thought was the query is actually the reference, and what I thought was just the other string is actually the query.

If you run the tests with the inputs swapped so that what you have labeled as the query is passed in as the reference, and the organism's sequence is passed in as the second argument(as the actual query parameter), the alignments should be correct.

I hope I was clear in explaining what I believe to be the problem.

On Sun, Dec 8, 2013 at 2:15 PM, Jordan Meyer jmeye006@ucr.edu wrote:

I think you're onto something. I will try to fix the problem as soon as I get back to a computer. On Dec 8, 2013 2:10 PM, "John Sullivan" notifications@github.com wrote:

I think your code is shortening the reference sequence you return to be the same size as the query sequence, and we end up losing data. Here's some debugging output from me putzing around...

name: gi|253409428|ref|GQ227366.1| Influenza A virus (A/pika/Qinghai/BI/2007(H5N1)) segment 1 polymerase PB2 (PB2) gene, complete cds score: 39.0

organism.sequence:

ATGGAGAGAATAAAGGAATTAAGAGATCTAATGTCACAGTCCCGCACTCGCGAGATACTAACAAAGACCACTGTGGACCATATGGCCATAATCAAGAAATACACATCAGGAAGACAAGAGAAGAACCCTGCTCTCAGAATGAAATGGATGATGGCAATGAAATATCCAATCACAGCGGACAAGAGAATAATAGAGATGATTCCTGAAAGGAATGAACAAGGACAGACACTCTGGAGCAAGACAAATGATGCTGGATCGGACAGGGTGATGGTGTCTCCCCTAGCTGTAACTTGGTGGAATAGGAATGGGCCGACGACAAGTACAGTTCATTATCCAAAGGTTTACAAAACATACTTTGAGAAGGTTGAAAGGTTAAAACATGGAACCTTCGGTCCCGTTCATTTCCGAAACCAAGTTAAAATACGCCGCCGAGTTGATACAAATCCTGGCCATGCAGATCTCAGTGCTAAAGAAGCACAAGATGTCATCATGGAGGTCGTTTTCCCAAATGAAGTGGGAGCTAGAATATTGACTTCAGAGTCACAGTTGACAATAACGAAAGAGAAAAAAGAAGAGCTCCAAGATTGTAAGATTGCTCCCTTAATGGTTGCATACATGTTGGAAAGGGAACTGGTCCGCAAAACCAGATTCCTACCAGTAGCAGGCGGAACAAGCAGTGTGTACATTGAGGTATTGCATTTGACTCAAGGAACCTGCTGGGCACAGATGTACACTCCAGGCGGAGAAGTAAGAAATGACGATGTTGACCAGAGTTTGATCATTGCTGCCAGAAACATTGTTAGGAGAGCAACGGTATCAGCGGATCCACTGGCATCACTGCTGGAGATGTGTCACAGCACACAAATTGGTGGGATAAGGATGGTGGACATCCTTAGGCAAACTCCAACTGAGGAACAAGCTGTGGATATATGCAAAGCAGCAATGGGTCTGAGGATTAGTTCATCCTTTAGCTTTGGAG

GCTTCACTTTCAAAAGAACAAGTGGATCATCCGCCACGAAGGAAGAGGAAGTGCTTACAGGCAACCTCCAAACATTGAAAATAAGAGTACATGAGGGGTATGAGGAGTTCACAATGGTTGGGCAGAGGGCAACAGCTATCCTGAGGAAAGCAACTAGAAGGCTGATTCAGTTGATAGTAAGTGGAAGAAACGAACAATCAATCGCTGAGGCAATCATTGTAGCAATGGTGTTCTCACAGGAGGATCGCATGATAAAAGCAGTCCGAGGCGATCTGAATTTCGTAAACAGAGCAAACCAAAGATTAAACCCCATGCATCAACTCCTGAGACATTTTCAAAAGGACGCAAAAGTGCTATTTCAGAATTGGGGAACTGAGCCAATTGATAATGTCATGGGGATGATCGGAATATTACCTGACATGACTCCCAGCACAGAAACGTCACTGAGAGGAGTGAGAGTTAGTAAAATGGGAGTAGATGAGTATTCCAGCACTGAGAGAGTAGTTGTAAGCATTGACCGCTTCTTAAGGGTTCGAGACCAGCGGGGGAACGTACTCTTATCTCCCGAAGAGGTCAGCGAAACCCAGGGAACAGAGAAGTTGACAATAACATATTCATCATCAATGATGTGGGAAATCAACGGTCCTGAGTCAGTGCTTGTTAACACTTACCAATGGATCATTAGAAACTGGGAGACCGTGAAAATTCAGTGGTCTCAGGACCCCACGATGTTGTACAATAAGATGGAGTTTGAACCGTTCCAATCCTTGGTACCTAAAGCTGCCAGAGGTCAATACAGTGGATTTGTGAGAACATTATTCCAACAAATGCGTGACGTACTGGGGACATTTGATACTGTCCAGATAATAAAGCTGCTACCATTTGCAGCAGCCCCACCGAAGCAGAGCAGAATGCAGTTTTCTTCTCTAACTGTGAATGTGAGAGGCTCAGGAATGAGAATACTCATAAGGGGCAATTCCCCTGTGTTCAACTACAA

TAAGGCAACCCAAAGACTTACCGTTCTTGGAAAGGACGCAGGTGCATTAACAGAGGATCCAGATGAGGGGACAGCCGGAGTGGAATCTGCAGTACTGAGGGGGTTCCTAATTCTAGGCAAGGAGGACAAAAGATATGGACCAGCATTGAGCATCAATGAACTGAGCAATCTTGCAAAAGGGGAGAAAGCTAATGTGCTGATAGGGCAAGGAGACGTGGTGTTGGTAATGAAACGGAAACGGGACTCTAGCATACTTACTGACAGCCAGACAGCGACCAAAAGAATTCGGATGGCCATCAATTAGTGTCGAATTGTTTAAAAACGACCTTGTTTCTACT

reference_alignment:


query: AGCGAAAGCAGGTCAAATATATTCAATATGGAGAGAATAAAAGAATTAAG

query_alignment: GCGAAAGCAGGTCAAATATATTCAATATGGAGAGAATAAAAGAATTAAG

— Reply to this email directly or view it on GitHub<

https://github.com/galah-group/141-assignment/issues/17#issuecomment-30094331>

.

— Reply to this email directly or view it on GitHub< https://github.com/galah-group/141-assignment/issues/17#issuecomment-30095435>

.

John Sullivan (johnsullivan.name) Supplemental Instruction Mentor at UCR gpg --recv-keys 6A262D84

— Reply to this email directly or view it on GitHubhttps://github.com/galah-group/141-assignment/issues/17#issuecomment-30095611 .

itsjohncs commented 10 years ago

Made the change, I can confirm that all unit tests still pass. Fails on the longer test though.

~/Projects/141-assignment - master(+3/-2)*
: ± ./run.py data/our_database.txt data/our_tiny_query.txt -a -n 3
-vvv[DEBUG] Logging initialized.
[DEBUG] Contents of options: <Values at 0x7f314c620710: {'num_results':
'3', 'verbose': 3, 'score_file': None, 'no_colors': False, 'quiet': False,
'print_alignment': True}>
[DEBUG] Contents of args: ['data/our_database.txt',
'data/our_tiny_query.txt']
[DEBUG] Organism 'gi|392583980|ref|CY116651.1| Influenza A virus
(A/ferret/Indonesia/5-F1/2005(H5N1)) polymerase PB2 (PB2) gene, complete
cds' has score 39.2.
[DEBUG] Organism 'gi|291280817|ref|HM012479.1| Influenza A virus
(A/cheetah/CA/30954/2009(H1N1)) segment 4 hemagglutinin (HA) gene, complete
cds' has score 39.55.
Traceback (most recent call last):
  File "./run.py", line 6, in <module>
    sys.exit(dnasearch.main.main())
  File "/home/john/Projects/141-assignment/dnasearch/main.py", line 137, in
main
    gap_score)
  File "/home/john/Projects/141-assignment/dnasearch/similarity.py", line
75, in score
    a,b = backtrace(ref,query,bt)
  File "/home/john/Projects/141-assignment/dnasearch/similarity.py", line
94, in backtrace
    a += ref[i+1]
IndexError: string index out of range

I could have just swapped out the code incorrectly through. Could you send me a complete revised copy of similarity.py?

-John

On Sun, Dec 8, 2013 at 5:00 PM, AmateurHour notifications@github.comwrote:

Alright, I think changing this part of backtrace:

elif bt[i][j] == INSERTION: if len(b) > 0 and b[len(b)-1] != '' or len(b) == 0: b += query[j] b += '' j -= 1

to this:

elif bt[i][j] == INSERTION: if len(b) > 0 and b[len(b)-1] != '' or len(b) == 0: b += query[j] b += '' a += '_' j -= 1 yields the proper alignment. The score makes much more sense like that. The single line of code different will not alter the score, and still results in the correct results for the original test cases.

On Sun, Dec 8, 2013 at 2:58 PM, John Sullivan notifications@github.comwrote:

No problem. Looks like there might be another issue though. I don't think we're getting the ideal alignment, and I'm still not sure the score is being calculated correctly.

It looks like the smaller string is always padded with gaps on the left to excess. And I'm looking at one of the results right now and it definitely wouldn't add up to the score that we're getting. Try running the tiny query string against gi|544618446|ref|GU131178.2| Panthera leo isolate 5c cytochrome b gene, complete cds. I can send a screenshot if necessary.

On Sun, Dec 8, 2013 at 2:51 PM, AmateurHour notifications@github.comwrote:

Passing in the arguments in reverse order would cause the initial error you showed me with the screenshot, as well as why our scores suffered by 0.2, because of how the initial scoring matrix is built.

I think I had just spent so much time figuring out all the stupid stuff that is unclear in the spec that I completely forgot to let you know how the parameters actually function. I'm sorry for all the confusion and probably frustration. On Dec 8, 2013 2:36 PM, "Jordan Meyer" jmeye006@ucr.edu wrote:

I believe that is happening because you may be using the query as the reference and the organism's sequence as the reference, when it should be the other way around. The reference is the string we are searching for (usually it will be the shorter one), and the query is the one we are questioning against the reference. The query is should be the organism's sequence, and the reference should be the string all searches are based on.

I think this was another mismatch in the spec's portrayal of query vs reference. Originally in our stub for the scoring function, the function signature was def score(query_string, other_string, sub_score, gap_score):

I had changed the signature of my implementation to match the clarification Nick made during lab last week. What I originally thought was the query is actually the reference, and what I thought was just the other string is actually the query.

If you run the tests with the inputs swapped so that what you have labeled as the query is passed in as the reference, and the organism's sequence is passed in as the second argument(as the actual query parameter), the alignments should be correct.

I hope I was clear in explaining what I believe to be the problem.

On Sun, Dec 8, 2013 at 2:15 PM, Jordan Meyer jmeye006@ucr.edu wrote:

I think you're onto something. I will try to fix the problem as soon as I get back to a computer. On Dec 8, 2013 2:10 PM, "John Sullivan" notifications@github.com wrote:

I think your code is shortening the reference sequence you return to be the same size as the query sequence, and we end up losing data. Here's some debugging output from me putzing around...

name: gi|253409428|ref|GQ227366.1| Influenza A virus (A/pika/Qinghai/BI/2007(H5N1)) segment 1 polymerase PB2 (PB2) gene, complete cds score: 39.0

organism.sequence:

ATGGAGAGAATAAAGGAATTAAGAGATCTAATGTCACAGTCCCGCACTCGCGAGATACTAACAAAGACCACTGTGGACCATATGGCCATAATCAAGAAATACACATCAGGAAGACAAGAGAAGAACCCTGCTCTCAGAATGAAATGGATGATGGCAATGAAATATCCAATCACAGCGGACAAGAGAATAATAGAGATGATTCCTGAAAGGAATGAACAAGGACAGACACTCTGGAGCAAGACAAATGATGCTGGATCGGACAGGGTGATGGTGTCTCCCCTAGCTGTAACTTGGTGGAATAGGAATGGGCCGACGACAAGTACAGTTCATTATCCAAAGGTTTACAAAACATACTTTGAGAAGGTTGAAAGGTTAAAACATGGAACCTTCGGTCCCGTTCATTTCCGAAACCAAGTTAAAATACGCCGCCGAGTTGATACAAATCCTGGCCATGCAGATCTCAGTGCTAAAGAAGCACAAGATGTCATCATGGAGGTCGTTTTCCCAAATGAAGTGGGAGCTAGAATATTGACTTCAGAGTCACAGTTGACAATAACGAAAGAGAAAAAAGAAGAGCTCCAAGATTGTAAGATTGCTCCCTTAATGGTTGCATACATGTTGGAAAGGGAACTGGTCCGCAAAACCAGATTCCTACCAGTAGCAGGCGGAACAAGCAGTGTGTACATTGAGGTATTGCATTTGACTCAAGGAACCTGCTGGGCACAGATGTACACTCCAGGCGGAGAAGTAAGAAATGACGATGTTGACCAGAGTTTGATCATTGCTGCCAGAAACATTGTTAGGAGAGCAACGGTATCAGCGGATCCACTGGCATCACTGCTGGAGATGTGTCACAGCACACAAATTGGTGGGATAAGGATGGTGGACATCCTTAGGCAAACTCCAACTGAGGAACAAGCTGTGGATATATGCAAAGCAGCAATGGGTCTGAGGATTAGTTCATCCTTTAGCTTTGGAG

GCTTCACTTTCAAAAGAACAAGTGGATCATCCGCCACGAAGGAAGAGGAAGTGCTTACAGGCAACCTCCAAACATTGAAAATAAGAGTACATGAGGGGTATGAGGAGTTCACAATGGTTGGGCAGAGGGCAACAGCTATCCTGAGGAAAGCAACTAGAAGGCTGATTCAGTTGATAGTAAGTGGAAGAAACGAACAATCAATCGCTGAGGCAATCATTGTAGCAATGGTGTTCTCACAGGAGGATCGCATGATAAAAGCAGTCCGAGGCGATCTGAATTTCGTAAACAGAGCAAACCAAAGATTAAACCCCATGCATCAACTCCTGAGACATTTTCAAAAGGACGCAAAAGTGCTATTTCAGAATTGGGGAACTGAGCCAATTGATAATGTCATGGGGATGATCGGAATATTACCTGACATGACTCCCAGCACAGAAACGTCACTGAGAGGAGTGAGAGTTAGTAAAATGGGAGTAGATGAGTATTCCAGCACTGAGAGAGTAGTTGTAAGCATTGACCGCTTCTTAAGGGTTCGAGACCAGCGGGGGAACGTACTCTTATCTCCCGAAGAGGTCAGCGAAACCCAGGGAACAGAGAAGTTGACAATAACATATTCATCATCAATGATGTGGGAAATCAACGGTCCTGAGTCAGTGCTTGTTAACACTTACCAATGGATCATTAGAAACTGGGAGACCGTGAAAATTCAGTGGTCTCAGGACCCCACGATGTTGTACAATAAGATGGAGTTTGAACCGTTCCAATCCTTGGTACCTAAAGCTGCCAGAGGTCAATACAGTGGATTTGTGAGAACATTATTCCAACAAATGCGTGACGTACTGGGGACATTTGATACTGTCCAGATAATAAAGCTGCTACCATTTGCAGCAGCCCCACCGAAGCAGAGCAGAATGCAGTTTTCTTCTCTAACTGTGAATGTGAGAGGCTCAGGAATGAGAATACTCATAAGGGGCAATTCCCCTGTGTTCAACTACAA

TAAGGCAACCCAAAGACTTACCGTTCTTGGAAAGGACGCAGGTGCATTAACAGAGGATCCAGATGAGGGGACAGCCGGAGTGGAATCTGCAGTACTGAGGGGGTTCCTAATTCTAGGCAAGGAGGACAAAAGATATGGACCAGCATTGAGCATCAATGAACTGAGCAATCTTGCAAAAGGGGAGAAAGCTAATGTGCTGATAGGGCAAGGAGACGTGGTGTTGGTAATGAAACGGAAACGGGACTCTAGCATACTTACTGACAGCCAGACAGCGACCAAAAGAATTCGGATGGCCATCAATTAGTGTCGAATTGTTTAAAAACGACCTTGTTTCTACT

reference_alignment:


query: AGCGAAAGCAGGTCAAATATATTCAATATGGAGAGAATAAAAGAATTAAG

query_alignment: GCGAAAGCAGGTCAAATATATTCAATATGGAGAGAATAAAAGAATTAAG

— Reply to this email directly or view it on GitHub<

https://github.com/galah-group/141-assignment/issues/17#issuecomment-30094331>

.

— Reply to this email directly or view it on GitHub<

https://github.com/galah-group/141-assignment/issues/17#issuecomment-30095435>

.

John Sullivan (johnsullivan.name) Supplemental Instruction Mentor at UCR gpg --recv-keys 6A262D84

— Reply to this email directly or view it on GitHub< https://github.com/galah-group/141-assignment/issues/17#issuecomment-30095611>

.

— Reply to this email directly or view it on GitHubhttps://github.com/galah-group/141-assignment/issues/17#issuecomment-30099000 .

John Sullivan (johnsullivan.name) Supplemental Instruction Mentor at UCR gpg --recv-keys 6A262D84

itsjohncs commented 10 years ago

Any progress on this?

On Sun, Dec 8, 2013 at 5:05 PM, John Sullivan jsull003@ucr.edu wrote:

Made the change, I can confirm that all unit tests still pass. Fails on the longer test though.

~/Projects/141-assignment - master(+3/-2)*
: ± ./run.py data/our_database.txt data/our_tiny_query.txt -a -n 3
-vvv[DEBUG] Logging initialized.
[DEBUG] Contents of options: <Values at 0x7f314c620710: {'num_results':
'3', 'verbose': 3, 'score_file': None, 'no_colors': False, 'quiet': False,
'print_alignment': True}>
[DEBUG] Contents of args: ['data/our_database.txt',
'data/our_tiny_query.txt']
[DEBUG] Organism 'gi|392583980|ref|CY116651.1| Influenza A virus
(A/ferret/Indonesia/5-F1/2005(H5N1)) polymerase PB2 (PB2) gene, complete
cds' has score 39.2.
[DEBUG] Organism 'gi|291280817|ref|HM012479.1| Influenza A virus
(A/cheetah/CA/30954/2009(H1N1)) segment 4 hemagglutinin (HA) gene, complete
cds' has score 39.55.
Traceback (most recent call last):
  File "./run.py", line 6, in <module>
    sys.exit(dnasearch.main.main())
  File "/home/john/Projects/141-assignment/dnasearch/main.py", line 137,
in main
    gap_score)
  File "/home/john/Projects/141-assignment/dnasearch/similarity.py", line
75, in score
    a,b = backtrace(ref,query,bt)
  File "/home/john/Projects/141-assignment/dnasearch/similarity.py", line
94, in backtrace
    a += ref[i+1]
IndexError: string index out of range

I could have just swapped out the code incorrectly through. Could you send me a complete revised copy of similarity.py?

-John

On Sun, Dec 8, 2013 at 5:00 PM, AmateurHour notifications@github.comwrote:

Alright, I think changing this part of backtrace:

elif bt[i][j] == INSERTION: if len(b) > 0 and b[len(b)-1] != '' or len(b) == 0: b += query[j] b += '' j -= 1

to this:

elif bt[i][j] == INSERTION: if len(b) > 0 and b[len(b)-1] != '' or len(b) == 0: b += query[j] b += '' a += '_' j -= 1 yields the proper alignment. The score makes much more sense like that. The single line of code different will not alter the score, and still results in the correct results for the original test cases.

On Sun, Dec 8, 2013 at 2:58 PM, John Sullivan notifications@github.comwrote:

No problem. Looks like there might be another issue though. I don't think we're getting the ideal alignment, and I'm still not sure the score is being calculated correctly.

It looks like the smaller string is always padded with gaps on the left to excess. And I'm looking at one of the results right now and it definitely wouldn't add up to the score that we're getting. Try running the tiny query string against gi|544618446|ref|GU131178.2| Panthera leo isolate 5c cytochrome b gene, complete cds. I can send a screenshot if necessary.

On Sun, Dec 8, 2013 at 2:51 PM, AmateurHour notifications@github.comwrote:

Passing in the arguments in reverse order would cause the initial error you showed me with the screenshot, as well as why our scores suffered by 0.2, because of how the initial scoring matrix is built.

I think I had just spent so much time figuring out all the stupid stuff that is unclear in the spec that I completely forgot to let you know how the parameters actually function. I'm sorry for all the confusion and probably frustration. On Dec 8, 2013 2:36 PM, "Jordan Meyer" jmeye006@ucr.edu wrote:

I believe that is happening because you may be using the query as the reference and the organism's sequence as the reference, when it should be the other way around. The reference is the string we are searching for (usually it will be the shorter one), and the query is the one we are questioning against the reference. The query is should be the organism's sequence, and the reference should be the string all searches are based on.

I think this was another mismatch in the spec's portrayal of query vs reference. Originally in our stub for the scoring function, the function signature was def score(query_string, other_string, sub_score, gap_score):

I had changed the signature of my implementation to match the clarification Nick made during lab last week. What I originally thought was the query is actually the reference, and what I thought was just the other string is actually the query.

If you run the tests with the inputs swapped so that what you have labeled as the query is passed in as the reference, and the organism's sequence is passed in as the second argument(as the actual query parameter), the alignments should be correct.

I hope I was clear in explaining what I believe to be the problem.

On Sun, Dec 8, 2013 at 2:15 PM, Jordan Meyer jmeye006@ucr.edu wrote:

I think you're onto something. I will try to fix the problem as soon as I get back to a computer. On Dec 8, 2013 2:10 PM, "John Sullivan" notifications@github.com

wrote:

I think your code is shortening the reference sequence you return to be the same size as the query sequence, and we end up losing data. Here's some debugging output from me putzing around...

name: gi|253409428|ref|GQ227366.1| Influenza A virus (A/pika/Qinghai/BI/2007(H5N1)) segment 1 polymerase PB2 (PB2) gene, complete cds score: 39.0

organism.sequence:

ATGGAGAGAATAAAGGAATTAAGAGATCTAATGTCACAGTCCCGCACTCGCGAGATACTAACAAAGACCACTGTGGACCATATGGCCATAATCAAGAAATACACATCAGGAAGACAAGAGAAGAACCCTGCTCTCAGAATGAAATGGATGATGGCAATGAAATATCCAATCACAGCGGACAAGAGAATAATAGAGATGATTCCTGAAAGGAATGAACAAGGACAGACACTCTGGAGCAAGACAAATGATGCTGGATCGGACAGGGTGATGGTGTCTCCCCTAGCTGTAACTTGGTGGAATAGGAATGGGCCGACGACAAGTACAGTTCATTATCCAAAGGTTTACAAAACATACTTTGAGAAGGTTGAAAGGTTAAAACATGGAACCTTCGGTCCCGTTCATTTCCGAAACCAAGTTAAAATACGCCGCCGAGTTGATACAAATCCTGGCCATGCAGATCTCAGTGCTAAAGAAGCACAAGATGTCATCATGGAGGTCGTTTTCCCAAATGAAGTGGGAGCTAGAATATTGACTTCAGAGTCACAGTTGACAATAACGAAAGAGAAAAAAGAAGAGCTCCAAGATTGTAAGATTGCTCCCTTAATGGTTGCATACATGTTGGAAAGGGAACTGGTCCGCAAAACCAGATTCCTACCAGTAGCAGGCGGAACAAGCAGTGTGTACATTGAGGTATTGCATTTGACTCAAGGAACCTGCTGGGCACAGATGTACACTCCAGGCGGAGAAGTAAGAAATGACGATGTTGACCAGAGTTTGATCATTGCTGCCAGAAACATTGTTAGGAGAGCAACGGTATCAGCGGATCCACTGGCATCACTGCTGGAGATGTGTCACAGCACACAAATTGGTGGGATAAGGATGGTGGACATCCTTAGGCAAACTCCAACTGAGGAACAAGCTGTGGATATATGCAAAGCAGCAATGGGTCTGAGGATTAGTTCATCCTTTAGCTTTGGAG

GCTTCACTTTCAAAAGAACAAGTGGATCATCCGCCACGAAGGAAGAGGAAGTGCTTACAGGCAACCTCCAAACATTGAAAATAAGAGTACATGAGGGGTATGAGGAGTTCACAATGGTTGGGCAGAGGGCAACAGCTATCCTGAGGAAAGCAACTAGAAGGCTGATTCAGTTGATAGTAAGTGGAAGAAACGAACAATCAATCGCTGAGGCAATCATTGTAGCAATGGTGTTCTCACAGGAGGATCGCATGATAAAAGCAGTCCGAGGCGATCTGAATTTCGTAAACAGAGCAAACCAAAGATTAAACCCCATGCATCAACTCCTGAGACATTTTCAAAAGGACGCAAAAGTGCTATTTCAGAATTGGGGAACTGAGCCAATTGATAATGTCATGGGGATGATCGGAATATTACCTGACATGACTCCCAGCACAGAAACGTCACTGAGAGGAGTGAGAGTTAGTAAAATGGGAGTAGATGAGTATTCCAGCACTGAGAGAGTAGTTGTAAGCATTGACCGCTTCTTAAGGGTTCGAGACCAGCGGGGGAACGTACTCTTATCTCCCGAAGAGGTCAGCGAAACCCAGGGAACAGAGAAGTTGACAATAACATATTCATCATCAATGATGTGGGAAATCAACGGTCCTGAGTCAGTGCTTGTTAACACTTACCAATGGATCATTAGAAACTGGGAGACCGTGAAAATTCAGTGGTCTCAGGACCCCACGATGTTGTACAATAAGATGGAGTTTGAACCGTTCCAATCCTTGGTACCTAAAGCTGCCAGAGGTCAATACAGTGGATTTGTGAGAACATTATTCCAACAAATGCGTGACGTACTGGGGACATTTGATACTGTCCAGATAATAAAGCTGCTACCATTTGCAGCAGCCCCACCGAAGCAGAGCAGAATGCAGTTTTCTTCTCTAACTGTGAATGTGAGAGGCTCAGGAATGAGAATACTCATAAGGGGCAATTCCCCTGTGTTCAACTACAA

TAAGGCAACCCAAAGACTTACCGTTCTTGGAAAGGACGCAGGTGCATTAACAGAGGATCCAGATGAGGGGACAGCCGGAGTGGAATCTGCAGTACTGAGGGGGTTCCTAATTCTAGGCAAGGAGGACAAAAGATATGGACCAGCATTGAGCATCAATGAACTGAGCAATCTTGCAAAAGGGGAGAAAGCTAATGTGCTGATAGGGCAAGGAGACGTGGTGTTGGTAATGAAACGGAAACGGGACTCTAGCATACTTACTGACAGCCAGACAGCGACCAAAAGAATTCGGATGGCCATCAATTAGTGTCGAATTGTTTAAAAACGACCTTGTTTCTACT

reference_alignment:


query: AGCGAAAGCAGGTCAAATATATTCAATATGGAGAGAATAAAAGAATTAAG

query_alignment: GCGAAAGCAGGTCAAATATATTCAATATGGAGAGAATAAAAGAATTAAG

— Reply to this email directly or view it on GitHub<

https://github.com/galah-group/141-assignment/issues/17#issuecomment-30094331>

.

— Reply to this email directly or view it on GitHub<

https://github.com/galah-group/141-assignment/issues/17#issuecomment-30095435>

.

John Sullivan (johnsullivan.name) Supplemental Instruction Mentor at UCR gpg --recv-keys 6A262D84

— Reply to this email directly or view it on GitHub< https://github.com/galah-group/141-assignment/issues/17#issuecomment-30095611>

.

— Reply to this email directly or view it on GitHubhttps://github.com/galah-group/141-assignment/issues/17#issuecomment-30099000 .

John Sullivan (johnsullivan.name) Supplemental Instruction Mentor at UCR gpg --recv-keys 6A262D84

John Sullivan (johnsullivan.name) Supplemental Instruction Mentor at UCR gpg --recv-keys 6A262D84

jordanjmeyer commented 10 years ago

Did the file I sent you also not work? On Dec 8, 2013 7:02 PM, "John Sullivan" notifications@github.com wrote:

Any progress on this?

On Sun, Dec 8, 2013 at 5:05 PM, John Sullivan jsull003@ucr.edu wrote:

Made the change, I can confirm that all unit tests still pass. Fails on the longer test though.

~/Projects/141-assignment - master(+3/-2)*
: ± ./run.py data/our_database.txt data/our_tiny_query.txt -a -n 3
-vvv[DEBUG] Logging initialized.
[DEBUG] Contents of options: <Values at 0x7f314c620710: {'num_results':
'3', 'verbose': 3, 'score_file': None, 'no_colors': False, 'quiet':
False,
'print_alignment': True}>
[DEBUG] Contents of args: ['data/our_database.txt',
'data/our_tiny_query.txt']
[DEBUG] Organism 'gi|392583980|ref|CY116651.1| Influenza A virus
(A/ferret/Indonesia/5-F1/2005(H5N1)) polymerase PB2 (PB2) gene, complete
cds' has score 39.2.
[DEBUG] Organism 'gi|291280817|ref|HM012479.1| Influenza A virus
(A/cheetah/CA/30954/2009(H1N1)) segment 4 hemagglutinin (HA) gene,
complete
cds' has score 39.55.
Traceback (most recent call last):
File "./run.py", line 6, in <module>
sys.exit(dnasearch.main.main())
File "/home/john/Projects/141-assignment/dnasearch/main.py", line 137,
in main
gap_score)
File "/home/john/Projects/141-assignment/dnasearch/similarity.py", line
75, in score
a,b = backtrace(ref,query,bt)
File "/home/john/Projects/141-assignment/dnasearch/similarity.py", line
94, in backtrace
a += ref[i+1]
IndexError: string index out of range

I could have just swapped out the code incorrectly through. Could you send me a complete revised copy of similarity.py?

-John

On Sun, Dec 8, 2013 at 5:00 PM, AmateurHour notifications@github.comwrote:

Alright, I think changing this part of backtrace:

elif bt[i][j] == INSERTION: if len(b) > 0 and b[len(b)-1] != '' or len(b) == 0: b += query[j] b += '' j -= 1

to this:

elif bt[i][j] == INSERTION: if len(b) > 0 and b[len(b)-1] != '' or len(b) == 0: b += query[j] b += '' a += '_' j -= 1 yields the proper alignment. The score makes much more sense like that. The single line of code different will not alter the score, and still results in the correct results for the original test cases.

On Sun, Dec 8, 2013 at 2:58 PM, John Sullivan notifications@github.comwrote:

No problem. Looks like there might be another issue though. I don't think we're getting the ideal alignment, and I'm still not sure the score is being calculated correctly.

It looks like the smaller string is always padded with gaps on the left to excess. And I'm looking at one of the results right now and it definitely wouldn't add up to the score that we're getting. Try running the tiny query string against gi|544618446|ref|GU131178.2| Panthera leo isolate 5c cytochrome b gene, complete cds. I can send a screenshot if necessary.

On Sun, Dec 8, 2013 at 2:51 PM, AmateurHour notifications@github.comwrote:

Passing in the arguments in reverse order would cause the initial error you showed me with the screenshot, as well as why our scores suffered by 0.2, because of how the initial scoring matrix is built.

I think I had just spent so much time figuring out all the stupid stuff that is unclear in the spec that I completely forgot to let you know how the parameters actually function. I'm sorry for all the confusion and probably frustration. On Dec 8, 2013 2:36 PM, "Jordan Meyer" jmeye006@ucr.edu wrote:

I believe that is happening because you may be using the query as the reference and the organism's sequence as the reference, when it should be the other way around. The reference is the string we are searching for (usually it will be the shorter one), and the query is the one we are questioning against the reference. The query is should be the organism's sequence, and the reference should be the string all searches are based on.

I think this was another mismatch in the spec's portrayal of query vs reference. Originally in our stub for the scoring function, the function signature was def score(query_string, other_string, sub_score, gap_score):

I had changed the signature of my implementation to match the clarification Nick made during lab last week. What I originally thought was the query is actually the reference, and what I thought was just the other string is actually the query.

If you run the tests with the inputs swapped so that what you have labeled as the query is passed in as the reference, and the organism's sequence is passed in as the second argument(as the actual query parameter), the alignments should be correct.

I hope I was clear in explaining what I believe to be the problem.

On Sun, Dec 8, 2013 at 2:15 PM, Jordan Meyer jmeye006@ucr.edu wrote:

I think you're onto something. I will try to fix the problem as soon as I get back to a computer. On Dec 8, 2013 2:10 PM, "John Sullivan" < notifications@github.com>

wrote:

I think your code is shortening the reference sequence you return to be the same size as the query sequence, and we end up losing data. Here's some debugging output from me putzing around...

name: gi|253409428|ref|GQ227366.1| Influenza A virus (A/pika/Qinghai/BI/2007(H5N1)) segment 1 polymerase PB2 (PB2) gene, complete cds score: 39.0

organism.sequence:

ATGGAGAGAATAAAGGAATTAAGAGATCTAATGTCACAGTCCCGCACTCGCGAGATACTAACAAAGACCACTGTGGACCATATGGCCATAATCAAGAAATACACATCAGGAAGACAAGAGAAGAACCCTGCTCTCAGAATGAAATGGATGATGGCAATGAAATATCCAATCACAGCGGACAAGAGAATAATAGAGATGATTCCTGAAAGGAATGAACAAGGACAGACACTCTGGAGCAAGACAAATGATGCTGGATCGGACAGGGTGATGGTGTCTCCCCTAGCTGTAACTTGGTGGAATAGGAATGGGCCGACGACAAGTACAGTTCATTATCCAAAGGTTTACAAAACATACTTTGAGAAGGTTGAAAGGTTAAAACATGGAACCTTCGGTCCCGTTCATTTCCGAAACCAAGTTAAAATACGCCGCCGAGTTGATACAAATCCTGGCCATGCAGATCTCAGTGCTAAAGAAGCACAAGATGTCATCATGGAGGTCGTTTTCCCAAATGAAGTGGGAGCTAGAATATTGACTTCAGAGTCACAGTTGACAATAACGAAAGAGAAAAAAGAAGAGCTCCAAGATTGTAAGATTGCTCCCTTAATGGTTGCATACATGTTGGAAAGGGAACTGGTCCGCAAAACCAGATTCCTACCAGTAGCAGGCGGAACAAGCAGTGTGTACATTGAGGTATTGCATTTGACTCAAGGAACCTGCTGGGCACAGATGTACACTCCAGGCGGAGAAGTAAGAAATGACGATGTTGACCAGAGTTTGATCATTGCTGCCAGAAACATTGTTAGGAGAGCAACGGTATCAGCGGATCCACTGGCATCACTGCTGGAGATGTGTCACAGCACACAAATTGGTGGGATAAGGATGGTGGACATCCTTAGGCAAACTCCAACTGAGGAACAAGCTGTGGATATATGCAAAGCAGCAATGGGTCTGAGGATTAGTTCATCCTTTAGCTTTGGAG

GCTTCACTTTCAAAAGAACAAGTGGATCATCCGCCACGAAGGAAGAGGAAGTGCTTACAGGCAACCTCCAAACATTGAAAATAAGAGTACATGAGGGGTATGAGGAGTTCACAATGGTTGGGCAGAGGGCAACAGCTATCCTGAGGAAAGCAACTAGAAGGCTGATTCAGTTGATAGTAAGTGGAAGAAACGAACAATCAATCGCTGAGGCAATCATTGTAGCAATGGTGTTCTCACAGGAGGATCGCATGATAAAAGCAGTCCGAGGCGATCTGAATTTCGTAAACAGAGCAAACCAAAGATTAAACCCCATGCATCAACTCCTGAGACATTTTCAAAAGGACGCAAAAGTGCTATTTCAGAATTGGGGAACTGAGCCAATTGATAATGTCATGGGGATGATCGGAATATTACCTGACATGACTCCCAGCACAGAAACGTCACTGAGAGGAGTGAGAGTTAGTAAAATGGGAGTAGATGAGTATTCCAGCACTGAGAGAGTAGTTGTAAGCATTGACCGCTTCTTAAGGGTTCGAGACCAGCGGGGGAACGTACTCTTATCTCCCGAAGAGGTCAGCGAAACCCAGGGAACAGAGAAGTTGACAATAACATATTCATCATCAATGATGTGGGAAATCAACGGTCCTGAGTCAGTGCTTGTTAACACTTACCAATGGATCATTAGAAACTGGGAGACCGTGAAAATTCAGTGGTCTCAGGACCCCACGATGTTGTACAATAAGATGGAGTTTGAACCGTTCCAATCCTTGGTACCTAAAGCTGCCAGAGGTCAATACAGTGGATTTGTGAGAACATTATTCCAACAAATGCGTGACGTACTGGGGACATTTGATACTGTCCAGATAATAAAGCTGCTACCATTTGCAGCAGCCCCACCGAAGCAGAGCAGAATGCAGTTTTCTTCTCTAACTGTGAATGTGAGAGGCTCAGGAATGAGAATACTCATAAGGGGCAATTCCCCTGTGTTCAACTACAA

TAAGGCAACCCAAAGACTTACCGTTCTTGGAAAGGACGCAGGTGCATTAACAGAGGATCCAGATGAGGGGACAGCCGGAGTGGAATCTGCAGTACTGAGGGGGTTCCTAATTCTAGGCAAGGAGGACAAAAGATATGGACCAGCATTGAGCATCAATGAACTGAGCAATCTTGCAAAAGGGGAGAAAGCTAATGTGCTGATAGGGCAAGGAGACGTGGTGTTGGTAATGAAACGGAAACGGGACTCTAGCATACTTACTGACAGCCAGACAGCGACCAAAAGAATTCGGATGGCCATCAATTAGTGTCGAATTGTTTAAAAACGACCTTGTTTCTACT

reference_alignment:


query: AGCGAAAGCAGGTCAAATATATTCAATATGGAGAGAATAAAAGAATTAAG

query_alignment: GCGAAAGCAGGTCAAATATATTCAATATGGAGAGAATAAAAGAATTAAG

— Reply to this email directly or view it on GitHub<

https://github.com/galah-group/141-assignment/issues/17#issuecomment-30094331>

.

— Reply to this email directly or view it on GitHub<

https://github.com/galah-group/141-assignment/issues/17#issuecomment-30095435>

.

John Sullivan (johnsullivan.name) Supplemental Instruction Mentor at UCR gpg --recv-keys 6A262D84

— Reply to this email directly or view it on GitHub<

https://github.com/galah-group/141-assignment/issues/17#issuecomment-30095611>

.

— Reply to this email directly or view it on GitHub< https://github.com/galah-group/141-assignment/issues/17#issuecomment-30099000>

.

John Sullivan (johnsullivan.name) Supplemental Instruction Mentor at UCR gpg --recv-keys 6A262D84

John Sullivan (johnsullivan.name) Supplemental Instruction Mentor at UCR gpg --recv-keys 6A262D84

— Reply to this email directly or view it on GitHubhttps://github.com/galah-group/141-assignment/issues/17#issuecomment-30102799 .

itsjohncs commented 10 years ago

I did not receive an email. Where did you send it? My email is jsull003@ucr.edu

On Sun, Dec 8, 2013 at 7:17 PM, AmateurHour notifications@github.comwrote:

Did the file I sent you also not work? On Dec 8, 2013 7:02 PM, "John Sullivan" notifications@github.com wrote:

Any progress on this?

On Sun, Dec 8, 2013 at 5:05 PM, John Sullivan jsull003@ucr.edu wrote:

Made the change, I can confirm that all unit tests still pass. Fails on the longer test though.

~/Projects/141-assignment - master(+3/-2)*
: ± ./run.py data/our_database.txt data/our_tiny_query.txt -a -n 3
-vvv[DEBUG] Logging initialized.
[DEBUG] Contents of options: <Values at 0x7f314c620710:
{'num_results':
'3', 'verbose': 3, 'score_file': None, 'no_colors': False, 'quiet':
False,
'print_alignment': True}>
[DEBUG] Contents of args: ['data/our_database.txt',
'data/our_tiny_query.txt']
[DEBUG] Organism 'gi|392583980|ref|CY116651.1| Influenza A virus
(A/ferret/Indonesia/5-F1/2005(H5N1)) polymerase PB2 (PB2) gene,
complete
cds' has score 39.2.
[DEBUG] Organism 'gi|291280817|ref|HM012479.1| Influenza A virus
(A/cheetah/CA/30954/2009(H1N1)) segment 4 hemagglutinin (HA) gene,
complete
cds' has score 39.55.
Traceback (most recent call last):
File "./run.py", line 6, in <module>
sys.exit(dnasearch.main.main())
File "/home/john/Projects/141-assignment/dnasearch/main.py", line 137,
in main
gap_score)
File "/home/john/Projects/141-assignment/dnasearch/similarity.py",
line
75, in score
a,b = backtrace(ref,query,bt)
File "/home/john/Projects/141-assignment/dnasearch/similarity.py",
line
94, in backtrace
a += ref[i+1]
IndexError: string index out of range

I could have just swapped out the code incorrectly through. Could you send me a complete revised copy of similarity.py?

-John

On Sun, Dec 8, 2013 at 5:00 PM, AmateurHour notifications@github.comwrote:

Alright, I think changing this part of backtrace:

elif bt[i][j] == INSERTION: if len(b) > 0 and b[len(b)-1] != '' or len(b) == 0: b += query[j] b += '' j -= 1

to this:

elif bt[i][j] == INSERTION: if len(b) > 0 and b[len(b)-1] != '' or len(b) == 0: b += query[j] b += '' a += '_' j -= 1 yields the proper alignment. The score makes much more sense like that. The single line of code different will not alter the score, and still results in the correct results for the original test cases.

On Sun, Dec 8, 2013 at 2:58 PM, John Sullivan < notifications@github.com>wrote:

No problem. Looks like there might be another issue though. I don't think we're getting the ideal alignment, and I'm still not sure the score is being calculated correctly.

It looks like the smaller string is always padded with gaps on the left to excess. And I'm looking at one of the results right now and it definitely wouldn't add up to the score that we're getting. Try running the tiny query string against gi|544618446|ref|GU131178.2| Panthera leo isolate 5c cytochrome b gene, complete cds. I can send a screenshot if necessary.

On Sun, Dec 8, 2013 at 2:51 PM, AmateurHour < notifications@github.com>wrote:

Passing in the arguments in reverse order would cause the initial error you showed me with the screenshot, as well as why our scores suffered by 0.2, because of how the initial scoring matrix is built.

I think I had just spent so much time figuring out all the stupid stuff that is unclear in the spec that I completely forgot to let you know how the parameters actually function. I'm sorry for all the confusion and probably frustration. On Dec 8, 2013 2:36 PM, "Jordan Meyer" jmeye006@ucr.edu wrote:

I believe that is happening because you may be using the query as the reference and the organism's sequence as the reference, when it should be the other way around. The reference is the string we are searching for (usually it will be the shorter one), and the query is the one we are questioning against the reference. The query is should be the organism's sequence, and the reference should be the string all searches are based on.

I think this was another mismatch in the spec's portrayal of query vs reference. Originally in our stub for the scoring function, the function signature was def score(query_string, other_string, sub_score, gap_score):

I had changed the signature of my implementation to match the clarification Nick made during lab last week. What I originally thought was the query is actually the reference, and what I thought was just the other string is actually the query.

If you run the tests with the inputs swapped so that what you have labeled as the query is passed in as the reference, and the organism's sequence is passed in as the second argument(as the actual query parameter), the alignments should be correct.

I hope I was clear in explaining what I believe to be the problem.

On Sun, Dec 8, 2013 at 2:15 PM, Jordan Meyer jmeye006@ucr.edu

wrote:

I think you're onto something. I will try to fix the problem as soon as I get back to a computer. On Dec 8, 2013 2:10 PM, "John Sullivan" < notifications@github.com>

wrote:

I think your code is shortening the reference sequence you return to be the same size as the query sequence, and we end up losing data. Here's some debugging output from me putzing around...

name: gi|253409428|ref|GQ227366.1| Influenza A virus (A/pika/Qinghai/BI/2007(H5N1)) segment 1 polymerase PB2 (PB2) gene, complete cds score: 39.0

organism.sequence:

ATGGAGAGAATAAAGGAATTAAGAGATCTAATGTCACAGTCCCGCACTCGCGAGATACTAACAAAGACCACTGTGGACCATATGGCCATAATCAAGAAATACACATCAGGAAGACAAGAGAAGAACCCTGCTCTCAGAATGAAATGGATGATGGCAATGAAATATCCAATCACAGCGGACAAGAGAATAATAGAGATGATTCCTGAAAGGAATGAACAAGGACAGACACTCTGGAGCAAGACAAATGATGCTGGATCGGACAGGGTGATGGTGTCTCCCCTAGCTGTAACTTGGTGGAATAGGAATGGGCCGACGACAAGTACAGTTCATTATCCAAAGGTTTACAAAACATACTTTGAGAAGGTTGAAAGGTTAAAACATGGAACCTTCGGTCCCGTTCATTTCCGAAACCAAGTTAAAATACGCCGCCGAGTTGATACAAATCCTGGCCATGCAGATCTCAGTGCTAAAGAAGCACAAGATGTCATCATGGAGGTCGTTTTCCCAAATGAAGTGGGAGCTAGAATATTGACTTCAGAGTCACAGTTGACAATAACGAAAGAGAAAAAAGAAGAGCTCCAAGATTGTAAGATTGCTCCCTTAATGGTTGCATACATGTTGGAAAGGGAACTGGTCCGCAAAACCAGATTCCTACCAGTAGCAGGCGGAACAAGCAGTGTGTACATTGAGGTATTGCATTTGACTCAAGGAACCTGCTGGGCACAGATGTACACTCCAGGCGGAGAAGTAAGAAATGACGATGTTGACCAGAGTTTGATCATTGCTGCCAGAAACATTGTTAGGAGAGCAACGGTATCAGCGGATCCACTGGCATCACTGCTGGAGATGTGTCACAGCACACAAATTGGTGGGATAAGGATGGTGGACATCCTTAGGCAAACTCCAACTGAGGAACAAGCTGTGGATATATGCAAAGCAGCAATGGGTCTGAGGATTAGTTCATCCTTTAGCTTTGGAG

GCTTCACTTTCAAAAGAACAAGTGGATCATCCGCCACGAAGGAAGAGGAAGTGCTTACAGGCAACCTCCAAACATTGAAAATAAGAGTACATGAGGGGTATGAGGAGTTCACAATGGTTGGGCAGAGGGCAACAGCTATCCTGAGGAAAGCAACTAGAAGGCTGATTCAGTTGATAGTAAGTGGAAGAAACGAACAATCAATCGCTGAGGCAATCATTGTAGCAATGGTGTTCTCACAGGAGGATCGCATGATAAAAGCAGTCCGAGGCGATCTGAATTTCGTAAACAGAGCAAACCAAAGATTAAACCCCATGCATCAACTCCTGAGACATTTTCAAAAGGACGCAAAAGTGCTATTTCAGAATTGGGGAACTGAGCCAATTGATAATGTCATGGGGATGATCGGAATATTACCTGACATGACTCCCAGCACAGAAACGTCACTGAGAGGAGTGAGAGTTAGTAAAATGGGAGTAGATGAGTATTCCAGCACTGAGAGAGTAGTTGTAAGCATTGACCGCTTCTTAAGGGTTCGAGACCAGCGGGGGAACGTACTCTTATCTCCCGAAGAGGTCAGCGAAACCCAGGGAACAGAGAAGTTGACAATAACATATTCATCATCAATGATGTGGGAAATCAACGGTCCTGAGTCAGTGCTTGTTAACACTTACCAATGGATCATTAGAAACTGGGAGACCGTGAAAATTCAGTGGTCTCAGGACCCCACGATGTTGTACAATAAGATGGAGTTTGAACCGTTCCAATCCTTGGTACCTAAAGCTGCCAGAGGTCAATACAGTGGATTTGTGAGAACATTATTCCAACAAATGCGTGACGTACTGGGGACATTTGATACTGTCCAGATAATAAAGCTGCTACCATTTGCAGCAGCCCCACCGAAGCAGAGCAGAATGCAGTTTTCTTCTCTAACTGTGAATGTGAGAGGCTCAGGAATGAGAATACTCATAAGGGGCAATTCCCCTGTGTTCAACTACAA

TAAGGCAACCCAAAGACTTACCGTTCTTGGAAAGGACGCAGGTGCATTAACAGAGGATCCAGATGAGGGGACAGCCGGAGTGGAATCTGCAGTACTGAGGGGGTTCCTAATTCTAGGCAAGGAGGACAAAAGATATGGACCAGCATTGAGCATCAATGAACTGAGCAATCTTGCAAAAGGGGAGAAAGCTAATGTGCTGATAGGGCAAGGAGACGTGGTGTTGGTAATGAAACGGAAACGGGACTCTAGCATACTTACTGACAGCCAGACAGCGACCAAAAGAATTCGGATGGCCATCAATTAGTGTCGAATTGTTTAAAAACGACCTTGTTTCTACT

reference_alignment:


query: AGCGAAAGCAGGTCAAATATATTCAATATGGAGAGAATAAAAGAATTAAG

query_alignment: GCGAAAGCAGGTCAAATATATTCAATATGGAGAGAATAAAAGAATTAAG

— Reply to this email directly or view it on GitHub<

https://github.com/galah-group/141-assignment/issues/17#issuecomment-30094331>

.

— Reply to this email directly or view it on GitHub<

https://github.com/galah-group/141-assignment/issues/17#issuecomment-30095435>

.

John Sullivan (johnsullivan.name) Supplemental Instruction Mentor at UCR gpg --recv-keys 6A262D84

— Reply to this email directly or view it on GitHub<

https://github.com/galah-group/141-assignment/issues/17#issuecomment-30095611>

.

— Reply to this email directly or view it on GitHub<

https://github.com/galah-group/141-assignment/issues/17#issuecomment-30099000>

.

John Sullivan (johnsullivan.name) Supplemental Instruction Mentor at UCR gpg --recv-keys 6A262D84

John Sullivan (johnsullivan.name) Supplemental Instruction Mentor at UCR gpg --recv-keys 6A262D84

— Reply to this email directly or view it on GitHub< https://github.com/galah-group/141-assignment/issues/17#issuecomment-30102799>

.

— Reply to this email directly or view it on GitHubhttps://github.com/galah-group/141-assignment/issues/17#issuecomment-30103137 .

John Sullivan (johnsullivan.name) Supplemental Instruction Mentor at UCR gpg --recv-keys 6A262D84

jordanjmeyer commented 10 years ago

I sent it in this string of emails we have going. If you cannot find it, I will send it again as soon as I get home. Should be about 10-15 minutes On Dec 8, 2013 7:20 PM, "John Sullivan" notifications@github.com wrote:

I did not receive an email. Where did you send it? My email is jsull003@ucr.edu

On Sun, Dec 8, 2013 at 7:17 PM, AmateurHour notifications@github.comwrote:

Did the file I sent you also not work? On Dec 8, 2013 7:02 PM, "John Sullivan" notifications@github.com wrote:

Any progress on this?

On Sun, Dec 8, 2013 at 5:05 PM, John Sullivan jsull003@ucr.edu wrote:

Made the change, I can confirm that all unit tests still pass. Fails on the longer test though.

~/Projects/141-assignment - master(+3/-2)*
: ± ./run.py data/our_database.txt data/our_tiny_query.txt -a -n 3
-vvv[DEBUG] Logging initialized.
[DEBUG] Contents of options: <Values at 0x7f314c620710:
{'num_results':
'3', 'verbose': 3, 'score_file': None, 'no_colors': False, 'quiet':
False,
'print_alignment': True}>
[DEBUG] Contents of args: ['data/our_database.txt',
'data/our_tiny_query.txt']
[DEBUG] Organism 'gi|392583980|ref|CY116651.1| Influenza A virus
(A/ferret/Indonesia/5-F1/2005(H5N1)) polymerase PB2 (PB2) gene,
complete
cds' has score 39.2.
[DEBUG] Organism 'gi|291280817|ref|HM012479.1| Influenza A virus
(A/cheetah/CA/30954/2009(H1N1)) segment 4 hemagglutinin (HA) gene,
complete
cds' has score 39.55.
Traceback (most recent call last):
File "./run.py", line 6, in <module>
sys.exit(dnasearch.main.main())
File "/home/john/Projects/141-assignment/dnasearch/main.py", line
137,
in main
gap_score)
File "/home/john/Projects/141-assignment/dnasearch/similarity.py",
line
75, in score
a,b = backtrace(ref,query,bt)
File "/home/john/Projects/141-assignment/dnasearch/similarity.py",
line
94, in backtrace
a += ref[i+1]
IndexError: string index out of range

I could have just swapped out the code incorrectly through. Could you send me a complete revised copy of similarity.py?

-John

On Sun, Dec 8, 2013 at 5:00 PM, AmateurHour < notifications@github.com>wrote:

Alright, I think changing this part of backtrace:

elif bt[i][j] == INSERTION: if len(b) > 0 and b[len(b)-1] != '' or len(b) == 0: b += query[j] b += '' j -= 1

to this:

elif bt[i][j] == INSERTION: if len(b) > 0 and b[len(b)-1] != '' or len(b) == 0: b += query[j] b += '' a += '_' j -= 1 yields the proper alignment. The score makes much more sense like that. The single line of code different will not alter the score, and still results in the correct results for the original test cases.

On Sun, Dec 8, 2013 at 2:58 PM, John Sullivan < notifications@github.com>wrote:

No problem. Looks like there might be another issue though. I don't think we're getting the ideal alignment, and I'm still not sure the score is being calculated correctly.

It looks like the smaller string is always padded with gaps on the left to excess. And I'm looking at one of the results right now and it definitely wouldn't add up to the score that we're getting. Try running the tiny query string against gi|544618446|ref|GU131178.2| Panthera leo isolate 5c cytochrome b gene, complete cds. I can send a screenshot if necessary.

On Sun, Dec 8, 2013 at 2:51 PM, AmateurHour < notifications@github.com>wrote:

Passing in the arguments in reverse order would cause the initial error you showed me with the screenshot, as well as why our scores suffered by 0.2, because of how the initial scoring matrix is built.

I think I had just spent so much time figuring out all the stupid stuff that is unclear in the spec that I completely forgot to let you know how the parameters actually function. I'm sorry for all the confusion and probably frustration. On Dec 8, 2013 2:36 PM, "Jordan Meyer" jmeye006@ucr.edu wrote:

I believe that is happening because you may be using the query as the reference and the organism's sequence as the reference, when it should be the other way around. The reference is the string we are searching for (usually it will be the shorter one), and the query is the one we are questioning against the reference. The query is should be the organism's sequence, and the reference should be the string all searches are based on.

I think this was another mismatch in the spec's portrayal of query vs reference. Originally in our stub for the scoring function, the function signature was def score(query_string, other_string, sub_score, gap_score):

I had changed the signature of my implementation to match the clarification Nick made during lab last week. What I originally thought was the query is actually the reference, and what I thought was just the other string is actually the query.

If you run the tests with the inputs swapped so that what you have labeled as the query is passed in as the reference, and the organism's sequence is passed in as the second argument(as the actual query parameter), the alignments should be correct.

I hope I was clear in explaining what I believe to be the problem.

On Sun, Dec 8, 2013 at 2:15 PM, Jordan Meyer < jmeye006@ucr.edu>

wrote:

I think you're onto something. I will try to fix the problem as soon as I get back to a computer. On Dec 8, 2013 2:10 PM, "John Sullivan" < notifications@github.com>

wrote:

I think your code is shortening the reference sequence you return to be the same size as the query sequence, and we end up losing data. Here's some debugging output from me putzing around...

name: gi|253409428|ref|GQ227366.1| Influenza A virus (A/pika/Qinghai/BI/2007(H5N1)) segment 1 polymerase PB2 (PB2) gene, complete cds score: 39.0

organism.sequence:

ATGGAGAGAATAAAGGAATTAAGAGATCTAATGTCACAGTCCCGCACTCGCGAGATACTAACAAAGACCACTGTGGACCATATGGCCATAATCAAGAAATACACATCAGGAAGACAAGAGAAGAACCCTGCTCTCAGAATGAAATGGATGATGGCAATGAAATATCCAATCACAGCGGACAAGAGAATAATAGAGATGATTCCTGAAAGGAATGAACAAGGACAGACACTCTGGAGCAAGACAAATGATGCTGGATCGGACAGGGTGATGGTGTCTCCCCTAGCTGTAACTTGGTGGAATAGGAATGGGCCGACGACAAGTACAGTTCATTATCCAAAGGTTTACAAAACATACTTTGAGAAGGTTGAAAGGTTAAAACATGGAACCTTCGGTCCCGTTCATTTCCGAAACCAAGTTAAAATACGCCGCCGAGTTGATACAAATCCTGGCCATGCAGATCTCAGTGCTAAAGAAGCACAAGATGTCATCATGGAGGTCGTTTTCCCAAATGAAGTGGGAGCTAGAATATTGACTTCAGAGTCACAGTTGACAATAACGAAAGAGAAAAAAGAAGAGCTCCAAGATTGTAAGATTGCTCCCTTAATGGTTGCATACATGTTGGAAAGGGAACTGGTCCGCAAAACCAGATTCCTACCAGTAGCAGGCGGAACAAGCAGTGTGTACATTGAGGTATTGCATTTGACTCAAGGAACCTGCTGGGCACAGATGTACACTCCAGGCGGAGAAGTAAGAAATGACGATGTTGACCAGAGTTTGATCATTGCTGCCAGAAACATTGTTAGGAGAGCAACGGTATCAGCGGATCCACTGGCATCACTGCTGGAGATGTGTCACAGCACACAAATTGGTGGGATAAGGATGGTGGACATCCTTAGGCAAACTCCAACTGAGGAACAAGCTGTGGATATATGCAAAGCAGCAATGGGTCTGAGGATTAGTTCATCCTTTAGCTTTGGAG

GCTTCACTTTCAAAAGAACAAGTGGATCATCCGCCACGAAGGAAGAGGAAGTGCTTACAGGCAACCTCCAAACATTGAAAATAAGAGTACATGAGGGGTATGAGGAGTTCACAATGGTTGGGCAGAGGGCAACAGCTATCCTGAGGAAAGCAACTAGAAGGCTGATTCAGTTGATAGTAAGTGGAAGAAACGAACAATCAATCGCTGAGGCAATCATTGTAGCAATGGTGTTCTCACAGGAGGATCGCATGATAAAAGCAGTCCGAGGCGATCTGAATTTCGTAAACAGAGCAAACCAAAGATTAAACCCCATGCATCAACTCCTGAGACATTTTCAAAAGGACGCAAAAGTGCTATTTCAGAATTGGGGAACTGAGCCAATTGATAATGTCATGGGGATGATCGGAATATTACCTGACATGACTCCCAGCACAGAAACGTCACTGAGAGGAGTGAGAGTTAGTAAAATGGGAGTAGATGAGTATTCCAGCACTGAGAGAGTAGTTGTAAGCATTGACCGCTTCTTAAGGGTTCGAGACCAGCGGGGGAACGTACTCTTATCTCCCGAAGAGGTCAGCGAAACCCAGGGAACAGAGAAGTTGACAATAACATATTCATCATCAATGATGTGGGAAATCAACGGTCCTGAGTCAGTGCTTGTTAACACTTACCAATGGATCATTAGAAACTGGGAGACCGTGAAAATTCAGTGGTCTCAGGACCCCACGATGTTGTACAATAAGATGGAGTTTGAACCGTTCCAATCCTTGGTACCTAAAGCTGCCAGAGGTCAATACAGTGGATTTGTGAGAACATTATTCCAACAAATGCGTGACGTACTGGGGACATTTGATACTGTCCAGATAATAAAGCTGCTACCATTTGCAGCAGCCCCACCGAAGCAGAGCAGAATGCAGTTTTCTTCTCTAACTGTGAATGTGAGAGGCTCAGGAATGAGAATACTCATAAGGGGCAATTCCCCTGTGTTCAACTACAA

TAAGGCAACCCAAAGACTTACCGTTCTTGGAAAGGACGCAGGTGCATTAACAGAGGATCCAGATGAGGGGACAGCCGGAGTGGAATCTGCAGTACTGAGGGGGTTCCTAATTCTAGGCAAGGAGGACAAAAGATATGGACCAGCATTGAGCATCAATGAACTGAGCAATCTTGCAAAAGGGGAGAAAGCTAATGTGCTGATAGGGCAAGGAGACGTGGTGTTGGTAATGAAACGGAAACGGGACTCTAGCATACTTACTGACAGCCAGACAGCGACCAAAAGAATTCGGATGGCCATCAATTAGTGTCGAATTGTTTAAAAACGACCTTGTTTCTACT

reference_alignment:


query: AGCGAAAGCAGGTCAAATATATTCAATATGGAGAGAATAAAAGAATTAAG

query_alignment: GCGAAAGCAGGTCAAATATATTCAATATGGAGAGAATAAAAGAATTAAG

— Reply to this email directly or view it on GitHub<

https://github.com/galah-group/141-assignment/issues/17#issuecomment-30094331>

.

— Reply to this email directly or view it on GitHub<

https://github.com/galah-group/141-assignment/issues/17#issuecomment-30095435>

.

John Sullivan (johnsullivan.name) Supplemental Instruction Mentor at UCR gpg --recv-keys 6A262D84

— Reply to this email directly or view it on GitHub<

https://github.com/galah-group/141-assignment/issues/17#issuecomment-30095611>

.

— Reply to this email directly or view it on GitHub<

https://github.com/galah-group/141-assignment/issues/17#issuecomment-30099000>

.

John Sullivan (johnsullivan.name) Supplemental Instruction Mentor at UCR gpg --recv-keys 6A262D84

John Sullivan (johnsullivan.name) Supplemental Instruction Mentor at UCR gpg --recv-keys 6A262D84

— Reply to this email directly or view it on GitHub<

https://github.com/galah-group/141-assignment/issues/17#issuecomment-30102799>

.

— Reply to this email directly or view it on GitHub< https://github.com/galah-group/141-assignment/issues/17#issuecomment-30103137>

.

John Sullivan (johnsullivan.name) Supplemental Instruction Mentor at UCR gpg --recv-keys 6A262D84

— Reply to this email directly or view it on GitHubhttps://github.com/galah-group/141-assignment/issues/17#issuecomment-30103199 .

itsjohncs commented 10 years ago

I see, the emails are going through the GitHub issue tracker. We're not actually communicating directly but rather placing messages into a forum of sorts. The GitHub issue tracker does not support general file attachments, you'll have to send it to me directly.

On Sun, Dec 8, 2013 at 7:23 PM, AmateurHour notifications@github.comwrote:

I sent it in this string of emails we have going. If you cannot find it, I will send it again as soon as I get home. Should be about 10-15 minutes On Dec 8, 2013 7:20 PM, "John Sullivan" notifications@github.com wrote:

I did not receive an email. Where did you send it? My email is jsull003@ucr.edu

On Sun, Dec 8, 2013 at 7:17 PM, AmateurHour notifications@github.comwrote:

Did the file I sent you also not work? On Dec 8, 2013 7:02 PM, "John Sullivan" notifications@github.com wrote:

Any progress on this?

On Sun, Dec 8, 2013 at 5:05 PM, John Sullivan jsull003@ucr.edu wrote:

Made the change, I can confirm that all unit tests still pass. Fails on the longer test though.

~/Projects/141-assignment - master(+3/-2)*
: ± ./run.py data/our_database.txt data/our_tiny_query.txt -a -n 3
-vvv[DEBUG] Logging initialized.
[DEBUG] Contents of options: <Values at 0x7f314c620710:
{'num_results':
'3', 'verbose': 3, 'score_file': None, 'no_colors': False,
'quiet':
False,
'print_alignment': True}>
[DEBUG] Contents of args: ['data/our_database.txt',
'data/our_tiny_query.txt']
[DEBUG] Organism 'gi|392583980|ref|CY116651.1| Influenza A virus
(A/ferret/Indonesia/5-F1/2005(H5N1)) polymerase PB2 (PB2) gene,
complete
cds' has score 39.2.
[DEBUG] Organism 'gi|291280817|ref|HM012479.1| Influenza A virus
(A/cheetah/CA/30954/2009(H1N1)) segment 4 hemagglutinin (HA) gene,
complete
cds' has score 39.55.
Traceback (most recent call last):
File "./run.py", line 6, in <module>
sys.exit(dnasearch.main.main())
File "/home/john/Projects/141-assignment/dnasearch/main.py", line
137,
in main
gap_score)
File "/home/john/Projects/141-assignment/dnasearch/similarity.py",
line
75, in score
a,b = backtrace(ref,query,bt)
File "/home/john/Projects/141-assignment/dnasearch/similarity.py",
line
94, in backtrace
a += ref[i+1]
IndexError: string index out of range

I could have just swapped out the code incorrectly through. Could you send me a complete revised copy of similarity.py?

-John

On Sun, Dec 8, 2013 at 5:00 PM, AmateurHour < notifications@github.com>wrote:

Alright, I think changing this part of backtrace:

elif bt[i][j] == INSERTION: if len(b) > 0 and b[len(b)-1] != '' or len(b) == 0: b += query[j] b += '' j -= 1

to this:

elif bt[i][j] == INSERTION: if len(b) > 0 and b[len(b)-1] != '' or len(b) == 0: b += query[j] b += '' a += '_' j -= 1 yields the proper alignment. The score makes much more sense like that. The single line of code different will not alter the score, and still results in the correct results for the original test cases.

On Sun, Dec 8, 2013 at 2:58 PM, John Sullivan < notifications@github.com>wrote:

No problem. Looks like there might be another issue though. I don't think we're getting the ideal alignment, and I'm still not sure the score is being calculated correctly.

It looks like the smaller string is always padded with gaps on the left to excess. And I'm looking at one of the results right now and it definitely wouldn't add up to the score that we're getting. Try running the tiny query string against gi|544618446|ref|GU131178.2| Panthera leo isolate 5c cytochrome b gene, complete cds. I can send a screenshot if necessary.

On Sun, Dec 8, 2013 at 2:51 PM, AmateurHour < notifications@github.com>wrote:

Passing in the arguments in reverse order would cause the initial error you showed me with the screenshot, as well as why our scores suffered by 0.2, because of how the initial scoring matrix is built.

I think I had just spent so much time figuring out all the stupid stuff that is unclear in the spec that I completely forgot to let you know how the parameters actually function. I'm sorry for all the confusion and probably frustration. On Dec 8, 2013 2:36 PM, "Jordan Meyer" jmeye006@ucr.edu wrote:

I believe that is happening because you may be using the query as the reference and the organism's sequence as the reference, when it should be the other way around. The reference is the string we are searching for (usually it will be the shorter one), and the query is the one we are questioning against the reference. The query is should be the organism's sequence, and the reference should be the string all searches are based on.

I think this was another mismatch in the spec's portrayal of query vs reference. Originally in our stub for the scoring function, the function signature was def score(query_string, other_string, sub_score, gap_score):

I had changed the signature of my implementation to match the clarification Nick made during lab last week. What I originally thought was the query is actually the reference, and what I thought was just the other string is actually the query.

If you run the tests with the inputs swapped so that what you have labeled as the query is passed in as the reference, and the organism's sequence is passed in as the second argument(as the actual query parameter), the alignments should be correct.

I hope I was clear in explaining what I believe to be the problem.

On Sun, Dec 8, 2013 at 2:15 PM, Jordan Meyer < jmeye006@ucr.edu>

wrote:

I think you're onto something. I will try to fix the problem as soon as I get back to a computer. On Dec 8, 2013 2:10 PM, "John Sullivan" < notifications@github.com>

wrote:

I think your code is shortening the reference sequence you return to be the same size as the query sequence, and we end up losing data. Here's some debugging output from me putzing around...

name: gi|253409428|ref|GQ227366.1| Influenza A virus (A/pika/Qinghai/BI/2007(H5N1)) segment 1 polymerase PB2 (PB2) gene, complete cds score: 39.0

organism.sequence:

ATGGAGAGAATAAAGGAATTAAGAGATCTAATGTCACAGTCCCGCACTCGCGAGATACTAACAAAGACCACTGTGGACCATATGGCCATAATCAAGAAATACACATCAGGAAGACAAGAGAAGAACCCTGCTCTCAGAATGAAATGGATGATGGCAATGAAATATCCAATCACAGCGGACAAGAGAATAATAGAGATGATTCCTGAAAGGAATGAACAAGGACAGACACTCTGGAGCAAGACAAATGATGCTGGATCGGACAGGGTGATGGTGTCTCCCCTAGCTGTAACTTGGTGGAATAGGAATGGGCCGACGACAAGTACAGTTCATTATCCAAAGGTTTACAAAACATACTTTGAGAAGGTTGAAAGGTTAAAACATGGAACCTTCGGTCCCGTTCATTTCCGAAACCAAGTTAAAATACGCCGCCGAGTTGATACAAATCCTGGCCATGCAGATCTCAGTGCTAAAGAAGCACAAGATGTCATCATGGAGGTCGTTTTCCCAAATGAAGTGGGAGCTAGAATATTGACTTCAGAGTCACAGTTGACAATAACGAAAGAGAAAAAAGAAGAGCTCCAAGATTGTAAGATTGCTCCCTTAATGGTTGCATACATGTTGGAAAGGGAACTGGTCCGCAAAACCAGATTCCTACCAGTAGCAGGCGGAACAAGCAGTGTGTACATTGAGGTATTGCATTTGACTCAAGGAACCTGCTGGGCACAGATGTACACTCCAGGCGGAGAAGTAAGAAATGACGATGTTGACCAGAGTTTGATCATTGCTGCCAGAAACATTGTTAGGAGAGCAACGGTATCAGCGGATCCACTGGCATCACTGCTGGAGATGTGTCACAGCACACAAATTGGTGGGATAAGGATGGTGGACATCCTTAGGCAAACTCCAACTGAGGAACAAGCTGTGGATATATGCAAAGCAGCAATGGGTCTGAGGATTAGTTCATCCTTTAGCTTTGGAG

GCTTCACTTTCAAAAGAACAAGTGGATCATCCGCCACGAAGGAAGAGGAAGTGCTTACAGGCAACCTCCAAACATTGAAAATAAGAGTACATGAGGGGTATGAGGAGTTCACAATGGTTGGGCAGAGGGCAACAGCTATCCTGAGGAAAGCAACTAGAAGGCTGATTCAGTTGATAGTAAGTGGAAGAAACGAACAATCAATCGCTGAGGCAATCATTGTAGCAATGGTGTTCTCACAGGAGGATCGCATGATAAAAGCAGTCCGAGGCGATCTGAATTTCGTAAACAGAGCAAACCAAAGATTAAACCCCATGCATCAACTCCTGAGACATTTTCAAAAGGACGCAAAAGTGCTATTTCAGAATTGGGGAACTGAGCCAATTGATAATGTCATGGGGATGATCGGAATATTACCTGACATGACTCCCAGCACAGAAACGTCACTGAGAGGAGTGAGAGTTAGTAAAATGGGAGTAGATGAGTATTCCAGCACTGAGAGAGTAGTTGTAAGCATTGACCGCTTCTTAAGGGTTCGAGACCAGCGGGGGAACGTACTCTTATCTCCCGAAGAGGTCAGCGAAACCCAGGGAACAGAGAAGTTGACAATAACATATTCATCATCAATGATGTGGGAAATCAACGGTCCTGAGTCAGTGCTTGTTAACACTTACCAATGGATCATTAGAAACTGGGAGACCGTGAAAATTCAGTGGTCTCAGGACCCCACGATGTTGTACAATAAGATGGAGTTTGAACCGTTCCAATCCTTGGTACCTAAAGCTGCCAGAGGTCAATACAGTGGATTTGTGAGAACATTATTCCAACAAATGCGTGACGTACTGGGGACATTTGATACTGTCCAGATAATAAAGCTGCTACCATTTGCAGCAGCCCCACCGAAGCAGAGCAGAATGCAGTTTTCTTCTCTAACTGTGAATGTGAGAGGCTCAGGAATGAGAATACTCATAAGGGGCAATTCCCCTGTGTTCAACTACAA

TAAGGCAACCCAAAGACTTACCGTTCTTGGAAAGGACGCAGGTGCATTAACAGAGGATCCAGATGAGGGGACAGCCGGAGTGGAATCTGCAGTACTGAGGGGGTTCCTAATTCTAGGCAAGGAGGACAAAAGATATGGACCAGCATTGAGCATCAATGAACTGAGCAATCTTGCAAAAGGGGAGAAAGCTAATGTGCTGATAGGGCAAGGAGACGTGGTGTTGGTAATGAAACGGAAACGGGACTCTAGCATACTTACTGACAGCCAGACAGCGACCAAAAGAATTCGGATGGCCATCAATTAGTGTCGAATTGTTTAAAAACGACCTTGTTTCTACT

reference_alignment:


query: AGCGAAAGCAGGTCAAATATATTCAATATGGAGAGAATAAAAGAATTAAG

query_alignment: GCGAAAGCAGGTCAAATATATTCAATATGGAGAGAATAAAAGAATTAAG

— Reply to this email directly or view it on GitHub<

https://github.com/galah-group/141-assignment/issues/17#issuecomment-30094331>

.

— Reply to this email directly or view it on GitHub<

https://github.com/galah-group/141-assignment/issues/17#issuecomment-30095435>

.

John Sullivan (johnsullivan.name) Supplemental Instruction Mentor at UCR gpg --recv-keys 6A262D84

— Reply to this email directly or view it on GitHub<

https://github.com/galah-group/141-assignment/issues/17#issuecomment-30095611>

.

— Reply to this email directly or view it on GitHub<

https://github.com/galah-group/141-assignment/issues/17#issuecomment-30099000>

.

John Sullivan (johnsullivan.name) Supplemental Instruction Mentor at UCR gpg --recv-keys 6A262D84

John Sullivan (johnsullivan.name) Supplemental Instruction Mentor at UCR gpg --recv-keys 6A262D84

— Reply to this email directly or view it on GitHub<

https://github.com/galah-group/141-assignment/issues/17#issuecomment-30102799>

.

— Reply to this email directly or view it on GitHub<

https://github.com/galah-group/141-assignment/issues/17#issuecomment-30103137>

.

John Sullivan (johnsullivan.name) Supplemental Instruction Mentor at UCR gpg --recv-keys 6A262D84

— Reply to this email directly or view it on GitHub< https://github.com/galah-group/141-assignment/issues/17#issuecomment-30103199>

.

— Reply to this email directly or view it on GitHubhttps://github.com/galah-group/141-assignment/issues/17#issuecomment-30103295 .

John Sullivan (johnsullivan.name) Supplemental Instruction Mentor at UCR gpg --recv-keys 6A262D84

jordanjmeyer commented 10 years ago

Will do. Expect it in 10 minutes. On Dec 8, 2013 7:25 PM, "John Sullivan" notifications@github.com wrote:

I see, the emails are going through the GitHub issue tracker. We're not actually communicating directly but rather placing messages into a forum of sorts. The GitHub issue tracker does not support general file attachments, you'll have to send it to me directly.

On Sun, Dec 8, 2013 at 7:23 PM, AmateurHour notifications@github.comwrote:

I sent it in this string of emails we have going. If you cannot find it, I will send it again as soon as I get home. Should be about 10-15 minutes On Dec 8, 2013 7:20 PM, "John Sullivan" notifications@github.com wrote:

I did not receive an email. Where did you send it? My email is jsull003@ucr.edu

On Sun, Dec 8, 2013 at 7:17 PM, AmateurHour notifications@github.comwrote:

Did the file I sent you also not work? On Dec 8, 2013 7:02 PM, "John Sullivan" notifications@github.com wrote:

Any progress on this?

On Sun, Dec 8, 2013 at 5:05 PM, John Sullivan jsull003@ucr.edu wrote:

Made the change, I can confirm that all unit tests still pass. Fails on the longer test though.

~/Projects/141-assignment - master(+3/-2)*
: ± ./run.py data/our_database.txt data/our_tiny_query.txt -a -n
3
-vvv[DEBUG] Logging initialized.
[DEBUG] Contents of options: <Values at 0x7f314c620710:
{'num_results':
'3', 'verbose': 3, 'score_file': None, 'no_colors': False,
'quiet':
False,
'print_alignment': True}>
[DEBUG] Contents of args: ['data/our_database.txt',
'data/our_tiny_query.txt']
[DEBUG] Organism 'gi|392583980|ref|CY116651.1| Influenza A virus
(A/ferret/Indonesia/5-F1/2005(H5N1)) polymerase PB2 (PB2) gene,
complete
cds' has score 39.2.
[DEBUG] Organism 'gi|291280817|ref|HM012479.1| Influenza A virus
(A/cheetah/CA/30954/2009(H1N1)) segment 4 hemagglutinin (HA)
gene,
complete
cds' has score 39.55.
Traceback (most recent call last):
File "./run.py", line 6, in <module>
sys.exit(dnasearch.main.main())
File "/home/john/Projects/141-assignment/dnasearch/main.py",
line
137,
in main
gap_score)
File
"/home/john/Projects/141-assignment/dnasearch/similarity.py",
line
75, in score
a,b = backtrace(ref,query,bt)
File
"/home/john/Projects/141-assignment/dnasearch/similarity.py",
line
94, in backtrace
a += ref[i+1]
IndexError: string index out of range

I could have just swapped out the code incorrectly through. Could you send me a complete revised copy of similarity.py?

-John

On Sun, Dec 8, 2013 at 5:00 PM, AmateurHour < notifications@github.com>wrote:

Alright, I think changing this part of backtrace:

elif bt[i][j] == INSERTION: if len(b) > 0 and b[len(b)-1] != '' or len(b) == 0: b += query[j] b += '' j -= 1

to this:

elif bt[i][j] == INSERTION: if len(b) > 0 and b[len(b)-1] != '' or len(b) == 0: b += query[j] b += '' a += '_' j -= 1 yields the proper alignment. The score makes much more sense like that. The single line of code different will not alter the score, and still results in the correct results for the original test cases.

On Sun, Dec 8, 2013 at 2:58 PM, John Sullivan < notifications@github.com>wrote:

No problem. Looks like there might be another issue though. I don't think we're getting the ideal alignment, and I'm still not sure the score is being calculated correctly.

It looks like the smaller string is always padded with gaps on the left to excess. And I'm looking at one of the results right now and it definitely wouldn't add up to the score that we're getting. Try running the tiny query string against gi|544618446|ref|GU131178.2| Panthera leo isolate 5c cytochrome b gene, complete cds. I can send a screenshot if necessary.

On Sun, Dec 8, 2013 at 2:51 PM, AmateurHour < notifications@github.com>wrote:

Passing in the arguments in reverse order would cause the initial error you showed me with the screenshot, as well as why our scores suffered by 0.2, because of how the initial scoring matrix is built.

I think I had just spent so much time figuring out all the stupid stuff that is unclear in the spec that I completely forgot to let you know how the parameters actually function. I'm sorry for all the confusion and probably frustration. On Dec 8, 2013 2:36 PM, "Jordan Meyer" jmeye006@ucr.edu wrote:

I believe that is happening because you may be using the query as the reference and the organism's sequence as the reference, when it should be the other way around. The reference is the string we are searching for (usually it will be the shorter one), and the query is the one we are questioning against the reference. The query is should be the organism's sequence, and the reference should be the string all searches are based on.

I think this was another mismatch in the spec's portrayal of query vs reference. Originally in our stub for the scoring function, the function signature was def score(query_string, other_string, sub_score, gap_score):

I had changed the signature of my implementation to match the clarification Nick made during lab last week. What I originally thought was the query is actually the reference, and what I thought was just the other string is actually the query.

If you run the tests with the inputs swapped so that what you have labeled as the query is passed in as the reference, and the organism's sequence is passed in as the second argument(as the actual query parameter), the alignments should be correct.

I hope I was clear in explaining what I believe to be the problem.

On Sun, Dec 8, 2013 at 2:15 PM, Jordan Meyer < jmeye006@ucr.edu>

wrote:

I think you're onto something. I will try to fix the problem as soon as I get back to a computer. On Dec 8, 2013 2:10 PM, "John Sullivan" < notifications@github.com>

wrote:

I think your code is shortening the reference sequence you return to be the same size as the query sequence, and we end up losing data. Here's some debugging output from me putzing around...

name: gi|253409428|ref|GQ227366.1| Influenza A virus (A/pika/Qinghai/BI/2007(H5N1)) segment 1 polymerase PB2 (PB2) gene, complete cds score: 39.0

organism.sequence:

ATGGAGAGAATAAAGGAATTAAGAGATCTAATGTCACAGTCCCGCACTCGCGAGATACTAACAAAGACCACTGTGGACCATATGGCCATAATCAAGAAATACACATCAGGAAGACAAGAGAAGAACCCTGCTCTCAGAATGAAATGGATGATGGCAATGAAATATCCAATCACAGCGGACAAGAGAATAATAGAGATGATTCCTGAAAGGAATGAACAAGGACAGACACTCTGGAGCAAGACAAATGATGCTGGATCGGACAGGGTGATGGTGTCTCCCCTAGCTGTAACTTGGTGGAATAGGAATGGGCCGACGACAAGTACAGTTCATTATCCAAAGGTTTACAAAACATACTTTGAGAAGGTTGAAAGGTTAAAACATGGAACCTTCGGTCCCGTTCATTTCCGAAACCAAGTTAAAATACGCCGCCGAGTTGATACAAATCCTGGCCATGCAGATCTCAGTGCTAAAGAAGCACAAGATGTCATCATGGAGGTCGTTTTCCCAAATGAAGTGGGAGCTAGAATATTGACTTCAGAGTCACAGTTGACAATAACGAAAGAGAAAAAAGAAGAGCTCCAAGATTGTAAGATTGCTCCCTTAATGGTTGCATACATGTTGGAAAGGGAACTGGTCCGCAAAACCAGATTCCTACCAGTAGCAGGCGGAACAAGCAGTGTGTACATTGAGGTATTGCATTTGACTCAAGGAACCTGCTGGGCACAGATGTACACTCCAGGCGGAGAAGTAAGAAATGACGATGTTGACCAGAGTTTGATCATTGCTGCCAGAAACATTGTTAGGAGAGCAACGGTATCAGCGGATCCACTGGCATCACTGCTGGAGATGTGTCACAGCACACAAATTGGTGGGATAAGGATGGTGGACATCCTTAGGCAAACTCCAACTGAGGAACAAGCTGTGGATATATGCAAAGCAGCAATGGGTCTGAGGATTAGTTCATCCTTTAGCTTTGGAG

GCTTCACTTTCAAAAGAACAAGTGGATCATCCGCCACGAAGGAAGAGGAAGTGCTTACAGGCAACCTCCAAACATTGAAAATAAGAGTACATGAGGGGTATGAGGAGTTCACAATGGTTGGGCAGAGGGCAACAGCTATCCTGAGGAAAGCAACTAGAAGGCTGATTCAGTTGATAGTAAGTGGAAGAAACGAACAATCAATCGCTGAGGCAATCATTGTAGCAATGGTGTTCTCACAGGAGGATCGCATGATAAAAGCAGTCCGAGGCGATCTGAATTTCGTAAACAGAGCAAACCAAAGATTAAACCCCATGCATCAACTCCTGAGACATTTTCAAAAGGACGCAAAAGTGCTATTTCAGAATTGGGGAACTGAGCCAATTGATAATGTCATGGGGATGATCGGAATATTACCTGACATGACTCCCAGCACAGAAACGTCACTGAGAGGAGTGAGAGTTAGTAAAATGGGAGTAGATGAGTATTCCAGCACTGAGAGAGTAGTTGTAAGCATTGACCGCTTCTTAAGGGTTCGAGACCAGCGGGGGAACGTACTCTTATCTCCCGAAGAGGTCAGCGAAACCCAGGGAACAGAGAAGTTGACAATAACATATTCATCATCAATGATGTGGGAAATCAACGGTCCTGAGTCAGTGCTTGTTAACACTTACCAATGGATCATTAGAAACTGGGAGACCGTGAAAATTCAGTGGTCTCAGGACCCCACGATGTTGTACAATAAGATGGAGTTTGAACCGTTCCAATCCTTGGTACCTAAAGCTGCCAGAGGTCAATACAGTGGATTTGTGAGAACATTATTCCAACAAATGCGTGACGTACTGGGGACATTTGATACTGTCCAGATAATAAAGCTGCTACCATTTGCAGCAGCCCCACCGAAGCAGAGCAGAATGCAGTTTTCTTCTCTAACTGTGAATGTGAGAGGCTCAGGAATGAGAATACTCATAAGGGGCAATTCCCCTGTGTTCAACTACAA

TAAGGCAACCCAAAGACTTACCGTTCTTGGAAAGGACGCAGGTGCATTAACAGAGGATCCAGATGAGGGGACAGCCGGAGTGGAATCTGCAGTACTGAGGGGGTTCCTAATTCTAGGCAAGGAGGACAAAAGATATGGACCAGCATTGAGCATCAATGAACTGAGCAATCTTGCAAAAGGGGAGAAAGCTAATGTGCTGATAGGGCAAGGAGACGTGGTGTTGGTAATGAAACGGAAACGGGACTCTAGCATACTTACTGACAGCCAGACAGCGACCAAAAGAATTCGGATGGCCATCAATTAGTGTCGAATTGTTTAAAAACGACCTTGTTTCTACT

reference_alignment:


query: AGCGAAAGCAGGTCAAATATATTCAATATGGAGAGAATAAAAGAATTAAG

query_alignment: GCGAAAGCAGGTCAAATATATTCAATATGGAGAGAATAAAAGAATTAAG

— Reply to this email directly or view it on GitHub<

https://github.com/galah-group/141-assignment/issues/17#issuecomment-30094331>

.

— Reply to this email directly or view it on GitHub<

https://github.com/galah-group/141-assignment/issues/17#issuecomment-30095435>

.

John Sullivan (johnsullivan.name) Supplemental Instruction Mentor at UCR gpg --recv-keys 6A262D84

— Reply to this email directly or view it on GitHub<

https://github.com/galah-group/141-assignment/issues/17#issuecomment-30095611>

.

— Reply to this email directly or view it on GitHub<

https://github.com/galah-group/141-assignment/issues/17#issuecomment-30099000>

.

John Sullivan (johnsullivan.name) Supplemental Instruction Mentor at UCR gpg --recv-keys 6A262D84

John Sullivan (johnsullivan.name) Supplemental Instruction Mentor at UCR gpg --recv-keys 6A262D84

— Reply to this email directly or view it on GitHub<

https://github.com/galah-group/141-assignment/issues/17#issuecomment-30102799>

.

— Reply to this email directly or view it on GitHub<

https://github.com/galah-group/141-assignment/issues/17#issuecomment-30103137>

.

John Sullivan (johnsullivan.name) Supplemental Instruction Mentor at UCR gpg --recv-keys 6A262D84

— Reply to this email directly or view it on GitHub<

https://github.com/galah-group/141-assignment/issues/17#issuecomment-30103199>

.

— Reply to this email directly or view it on GitHub< https://github.com/galah-group/141-assignment/issues/17#issuecomment-30103295>

.

John Sullivan (johnsullivan.name) Supplemental Instruction Mentor at UCR gpg --recv-keys 6A262D84

— Reply to this email directly or view it on GitHubhttps://github.com/galah-group/141-assignment/issues/17#issuecomment-30103358 .

itsjohncs commented 10 years ago

The algorithm still doesn't seem to be giving the right results. The smaller sequence is padding with gaps to the left a lot and despite the not a lot of symbols matching the score seems to be fairly high. It is no longer erroring though.

On Sun, Dec 8, 2013 at 7:28 PM, AmateurHour notifications@github.comwrote:

Will do. Expect it in 10 minutes. On Dec 8, 2013 7:25 PM, "John Sullivan" notifications@github.com wrote:

I see, the emails are going through the GitHub issue tracker. We're not actually communicating directly but rather placing messages into a forum of sorts. The GitHub issue tracker does not support general file attachments, you'll have to send it to me directly.

On Sun, Dec 8, 2013 at 7:23 PM, AmateurHour notifications@github.comwrote:

I sent it in this string of emails we have going. If you cannot find it, I will send it again as soon as I get home. Should be about 10-15 minutes On Dec 8, 2013 7:20 PM, "John Sullivan" notifications@github.com wrote:

I did not receive an email. Where did you send it? My email is jsull003@ucr.edu

On Sun, Dec 8, 2013 at 7:17 PM, AmateurHour < notifications@github.com>wrote:

Did the file I sent you also not work? On Dec 8, 2013 7:02 PM, "John Sullivan" notifications@github.com

wrote:

Any progress on this?

On Sun, Dec 8, 2013 at 5:05 PM, John Sullivan jsull003@ucr.edu

wrote:

Made the change, I can confirm that all unit tests still pass. Fails on the longer test though.

~/Projects/141-assignment - master(+3/-2)*
: ± ./run.py data/our_database.txt data/our_tiny_query.txt -a
-n
3
-vvv[DEBUG] Logging initialized.
[DEBUG] Contents of options: <Values at 0x7f314c620710:
{'num_results':
'3', 'verbose': 3, 'score_file': None, 'no_colors': False,
'quiet':
False,
'print_alignment': True}>
[DEBUG] Contents of args: ['data/our_database.txt',
'data/our_tiny_query.txt']
[DEBUG] Organism 'gi|392583980|ref|CY116651.1| Influenza A
virus
(A/ferret/Indonesia/5-F1/2005(H5N1)) polymerase PB2 (PB2)
gene,
complete
cds' has score 39.2.
[DEBUG] Organism 'gi|291280817|ref|HM012479.1| Influenza A
virus
(A/cheetah/CA/30954/2009(H1N1)) segment 4 hemagglutinin (HA)
gene,
complete
cds' has score 39.55.
Traceback (most recent call last):
File "./run.py", line 6, in <module>
sys.exit(dnasearch.main.main())
File "/home/john/Projects/141-assignment/dnasearch/main.py",
line
137,
in main
gap_score)
File
"/home/john/Projects/141-assignment/dnasearch/similarity.py",
line
75, in score
a,b = backtrace(ref,query,bt)
File
"/home/john/Projects/141-assignment/dnasearch/similarity.py",
line
94, in backtrace
a += ref[i+1]
IndexError: string index out of range

I could have just swapped out the code incorrectly through. Could you send me a complete revised copy of similarity.py?

-John

On Sun, Dec 8, 2013 at 5:00 PM, AmateurHour < notifications@github.com>wrote:

Alright, I think changing this part of backtrace:

elif bt[i][j] == INSERTION: if len(b) > 0 and b[len(b)-1] != '' or len(b) == 0: b += query[j] b += '' j -= 1

to this:

elif bt[i][j] == INSERTION: if len(b) > 0 and b[len(b)-1] != '' or len(b) == 0: b += query[j] b += '' a += '_' j -= 1 yields the proper alignment. The score makes much more sense like that. The single line of code different will not alter the score, and still results in the correct results for the original test cases.

On Sun, Dec 8, 2013 at 2:58 PM, John Sullivan < notifications@github.com>wrote:

No problem. Looks like there might be another issue though. I don't think we're getting the ideal alignment, and I'm still not sure the score is being calculated correctly.

It looks like the smaller string is always padded with gaps on the left to excess. And I'm looking at one of the results right now and it definitely wouldn't add up to the score that we're getting. Try running the tiny query string against gi|544618446|ref|GU131178.2| Panthera leo isolate 5c cytochrome b gene, complete cds. I can send a screenshot if necessary.

On Sun, Dec 8, 2013 at 2:51 PM, AmateurHour < notifications@github.com>wrote:

Passing in the arguments in reverse order would cause the initial error you showed me with the screenshot, as well as why our scores suffered by 0.2, because of how the initial scoring matrix is built.

I think I had just spent so much time figuring out all the stupid stuff that is unclear in the spec that I completely forgot to let you know how the parameters actually function. I'm sorry for all the confusion and probably frustration. On Dec 8, 2013 2:36 PM, "Jordan Meyer" jmeye006@ucr.edu

wrote:

I believe that is happening because you may be using the query as the reference and the organism's sequence as the reference, when it should be the other way around. The reference is the string we are searching for (usually it will be the shorter one), and the query is the one we are questioning against the reference. The query is should be the organism's sequence, and the reference should be the string all searches are based on.

I think this was another mismatch in the spec's portrayal of query vs reference. Originally in our stub for the scoring function, the function signature was def score(query_string, other_string, sub_score, gap_score):

I had changed the signature of my implementation to match the clarification Nick made during lab last week. What I originally thought was the query is actually the reference, and what I thought was just the other string is actually the query.

If you run the tests with the inputs swapped so that what you have labeled as the query is passed in as the reference, and the organism's sequence is passed in as the second argument(as the actual query parameter), the alignments should be correct.

I hope I was clear in explaining what I believe to be the problem.

On Sun, Dec 8, 2013 at 2:15 PM, Jordan Meyer < jmeye006@ucr.edu>

wrote:

I think you're onto something. I will try to fix the problem as soon as I get back to a computer. On Dec 8, 2013 2:10 PM, "John Sullivan" < notifications@github.com>

wrote:

I think your code is shortening the reference sequence you return to be the same size as the query sequence, and we end up losing data. Here's some debugging output from me putzing around...

name: gi|253409428|ref|GQ227366.1| Influenza A virus (A/pika/Qinghai/BI/2007(H5N1)) segment 1 polymerase PB2 (PB2) gene, complete cds score: 39.0

organism.sequence:

ATGGAGAGAATAAAGGAATTAAGAGATCTAATGTCACAGTCCCGCACTCGCGAGATACTAACAAAGACCACTGTGGACCATATGGCCATAATCAAGAAATACACATCAGGAAGACAAGAGAAGAACCCTGCTCTCAGAATGAAATGGATGATGGCAATGAAATATCCAATCACAGCGGACAAGAGAATAATAGAGATGATTCCTGAAAGGAATGAACAAGGACAGACACTCTGGAGCAAGACAAATGATGCTGGATCGGACAGGGTGATGGTGTCTCCCCTAGCTGTAACTTGGTGGAATAGGAATGGGCCGACGACAAGTACAGTTCATTATCCAAAGGTTTACAAAACATACTTTGAGAAGGTTGAAAGGTTAAAACATGGAACCTTCGGTCCCGTTCATTTCCGAAACCAAGTTAAAATACGCCGCCGAGTTGATACAAATCCTGGCCATGCAGATCTCAGTGCTAAAGAAGCACAAGATGTCATCATGGAGGTCGTTTTCCCAAATGAAGTGGGAGCTAGAATATTGACTTCAGAGTCACAGTTGACAATAACGAAAGAGAAAAAAGAAGAGCTCCAAGATTGTAAGATTGCTCCCTTAATGGTTGCATACATGTTGGAAAGGGAACTGGTCCGCAAAACCAGATTCCTACCAGTAGCAGGCGGAACAAGCAGTGTGTACATTGAGGTATTGCATTTGACTCAAGGAACCTGCTGGGCACAGATGTACACTCCAGGCGGAGAAGTAAGAAATGACGATGTTGACCAGAGTTTGATCATTGCTGCCAGAAACATTGTTAGGAGAGCAACGGTATCAGCGGATCCACTGGCATCACTGCTGGAGATGTGTCACAGCACACAAATTGGTGGGATAAGGATGGTGGACATCCTTAGGCAAACTCCAACTGAGGAACAAGCTGTGGATATATGCAAAGCAGCAATGGGTCTGAGGATTAGTTCATCCTTTAGCTTTGGAG

GCTTCACTTTCAAAAGAACAAGTGGATCATCCGCCACGAAGGAAGAGGAAGTGCTTACAGGCAACCTCCAAACATTGAAAATAAGAGTACATGAGGGGTATGAGGAGTTCACAATGGTTGGGCAGAGGGCAACAGCTATCCTGAGGAAAGCAACTAGAAGGCTGATTCAGTTGATAGTAAGTGGAAGAAACGAACAATCAATCGCTGAGGCAATCATTGTAGCAATGGTGTTCTCACAGGAGGATCGCATGATAAAAGCAGTCCGAGGCGATCTGAATTTCGTAAACAGAGCAAACCAAAGATTAAACCCCATGCATCAACTCCTGAGACATTTTCAAAAGGACGCAAAAGTGCTATTTCAGAATTGGGGAACTGAGCCAATTGATAATGTCATGGGGATGATCGGAATATTACCTGACATGACTCCCAGCACAGAAACGTCACTGAGAGGAGTGAGAGTTAGTAAAATGGGAGTAGATGAGTATTCCAGCACTGAGAGAGTAGTTGTAAGCATTGACCGCTTCTTAAGGGTTCGAGACCAGCGGGGGAACGTACTCTTATCTCCCGAAGAGGTCAGCGAAACCCAGGGAACAGAGAAGTTGACAATAACATATTCATCATCAATGATGTGGGAAATCAACGGTCCTGAGTCAGTGCTTGTTAACACTTACCAATGGATCATTAGAAACTGGGAGACCGTGAAAATTCAGTGGTCTCAGGACCCCACGATGTTGTACAATAAGATGGAGTTTGAACCGTTCCAATCCTTGGTACCTAAAGCTGCCAGAGGTCAATACAGTGGATTTGTGAGAACATTATTCCAACAAATGCGTGACGTACTGGGGACATTTGATACTGTCCAGATAATAAAGCTGCTACCATTTGCAGCAGCCCCACCGAAGCAGAGCAGAATGCAGTTTTCTTCTCTAACTGTGAATGTGAGAGGCTCAGGAATGAGAATACTCATAAGGGGCAATTCCCCTGTGTTCAACTACAA

TAAGGCAACCCAAAGACTTACCGTTCTTGGAAAGGACGCAGGTGCATTAACAGAGGATCCAGATGAGGGGACAGCCGGAGTGGAATCTGCAGTACTGAGGGGGTTCCTAATTCTAGGCAAGGAGGACAAAAGATATGGACCAGCATTGAGCATCAATGAACTGAGCAATCTTGCAAAAGGGGAGAAAGCTAATGTGCTGATAGGGCAAGGAGACGTGGTGTTGGTAATGAAACGGAAACGGGACTCTAGCATACTTACTGACAGCCAGACAGCGACCAAAAGAATTCGGATGGCCATCAATTAGTGTCGAATTGTTTAAAAACGACCTTGTTTCTACT

reference_alignment:


query: AGCGAAAGCAGGTCAAATATATTCAATATGGAGAGAATAAAAGAATTAAG

query_alignment: GCGAAAGCAGGTCAAATATATTCAATATGGAGAGAATAAAAGAATTAAG

— Reply to this email directly or view it on GitHub<

https://github.com/galah-group/141-assignment/issues/17#issuecomment-30094331>

.

— Reply to this email directly or view it on GitHub<

https://github.com/galah-group/141-assignment/issues/17#issuecomment-30095435>

.

John Sullivan (johnsullivan.name) Supplemental Instruction Mentor at UCR gpg --recv-keys 6A262D84

— Reply to this email directly or view it on GitHub<

https://github.com/galah-group/141-assignment/issues/17#issuecomment-30095611>

.

— Reply to this email directly or view it on GitHub<

https://github.com/galah-group/141-assignment/issues/17#issuecomment-30099000>

.

John Sullivan (johnsullivan.name) Supplemental Instruction Mentor at UCR gpg --recv-keys 6A262D84

John Sullivan (johnsullivan.name) Supplemental Instruction Mentor at UCR gpg --recv-keys 6A262D84

— Reply to this email directly or view it on GitHub<

https://github.com/galah-group/141-assignment/issues/17#issuecomment-30102799>

.

— Reply to this email directly or view it on GitHub<

https://github.com/galah-group/141-assignment/issues/17#issuecomment-30103137>

.

John Sullivan (johnsullivan.name) Supplemental Instruction Mentor at UCR gpg --recv-keys 6A262D84

— Reply to this email directly or view it on GitHub<

https://github.com/galah-group/141-assignment/issues/17#issuecomment-30103199>

.

— Reply to this email directly or view it on GitHub<

https://github.com/galah-group/141-assignment/issues/17#issuecomment-30103295>

.

John Sullivan (johnsullivan.name) Supplemental Instruction Mentor at UCR gpg --recv-keys 6A262D84

— Reply to this email directly or view it on GitHub< https://github.com/galah-group/141-assignment/issues/17#issuecomment-30103358>

.

— Reply to this email directly or view it on GitHubhttps://github.com/galah-group/141-assignment/issues/17#issuecomment-30103448 .

John Sullivan (johnsullivan.name) Supplemental Instruction Mentor at UCR gpg --recv-keys 6A262D84

jordanjmeyer commented 10 years ago

Nick said there would be a lot of padding because leading insertions don't count towards the final score. So the algorithm pads up to the point where the strings begin the highest similarity.

I'm not sure what else could be modified with the algorithm. But the characters should match up correctly so that most of them are perfect matches, and then gaps fill in, because gaps have decreasing cost per additional space, so they cost much less than substitutions (mismatches). On Dec 8, 2013 7:47 PM, "John Sullivan" notifications@github.com wrote:

The algorithm still doesn't seem to be giving the right results. The smaller sequence is padding with gaps to the left a lot and despite the not a lot of symbols matching the score seems to be fairly high. It is no longer erroring though.

On Sun, Dec 8, 2013 at 7:28 PM, AmateurHour notifications@github.comwrote:

Will do. Expect it in 10 minutes. On Dec 8, 2013 7:25 PM, "John Sullivan" notifications@github.com wrote:

I see, the emails are going through the GitHub issue tracker. We're not actually communicating directly but rather placing messages into a forum of sorts. The GitHub issue tracker does not support general file attachments, you'll have to send it to me directly.

On Sun, Dec 8, 2013 at 7:23 PM, AmateurHour notifications@github.comwrote:

I sent it in this string of emails we have going. If you cannot find it, I will send it again as soon as I get home. Should be about 10-15 minutes On Dec 8, 2013 7:20 PM, "John Sullivan" notifications@github.com wrote:

I did not receive an email. Where did you send it? My email is jsull003@ucr.edu

On Sun, Dec 8, 2013 at 7:17 PM, AmateurHour < notifications@github.com>wrote:

Did the file I sent you also not work? On Dec 8, 2013 7:02 PM, "John Sullivan" < notifications@github.com>

wrote:

Any progress on this?

On Sun, Dec 8, 2013 at 5:05 PM, John Sullivan < jsull003@ucr.edu>

wrote:

Made the change, I can confirm that all unit tests still pass. Fails on the longer test though.

~/Projects/141-assignment - master(+3/-2)*
: ± ./run.py data/our_database.txt data/our_tiny_query.txt
-a
-n
3
-vvv[DEBUG] Logging initialized.
[DEBUG] Contents of options: <Values at 0x7f314c620710:
{'num_results':
'3', 'verbose': 3, 'score_file': None, 'no_colors': False,
'quiet':
False,
'print_alignment': True}>
[DEBUG] Contents of args: ['data/our_database.txt',
'data/our_tiny_query.txt']
[DEBUG] Organism 'gi|392583980|ref|CY116651.1| Influenza A
virus
(A/ferret/Indonesia/5-F1/2005(H5N1)) polymerase PB2 (PB2)
gene,
complete
cds' has score 39.2.
[DEBUG] Organism 'gi|291280817|ref|HM012479.1| Influenza A
virus
(A/cheetah/CA/30954/2009(H1N1)) segment 4 hemagglutinin (HA)
gene,
complete
cds' has score 39.55.
Traceback (most recent call last):
File "./run.py", line 6, in <module>
sys.exit(dnasearch.main.main())
File "/home/john/Projects/141-assignment/dnasearch/main.py",
line
137,
in main
gap_score)
File
"/home/john/Projects/141-assignment/dnasearch/similarity.py",
line
75, in score
a,b = backtrace(ref,query,bt)
File
"/home/john/Projects/141-assignment/dnasearch/similarity.py",
line
94, in backtrace
a += ref[i+1]
IndexError: string index out of range

I could have just swapped out the code incorrectly through. Could you send me a complete revised copy of similarity.py?

-John

On Sun, Dec 8, 2013 at 5:00 PM, AmateurHour < notifications@github.com>wrote:

Alright, I think changing this part of backtrace:

elif bt[i][j] == INSERTION: if len(b) > 0 and b[len(b)-1] != '' or len(b) == 0: b += query[j] b += '' j -= 1

to this:

elif bt[i][j] == INSERTION: if len(b) > 0 and b[len(b)-1] != '' or len(b) == 0: b += query[j] b += '' a += '_' j -= 1 yields the proper alignment. The score makes much more sense like that. The single line of code different will not alter the score, and still results in the correct results for the original test cases.

On Sun, Dec 8, 2013 at 2:58 PM, John Sullivan < notifications@github.com>wrote:

No problem. Looks like there might be another issue though. I don't think we're getting the ideal alignment, and I'm still not sure the score is being calculated correctly.

It looks like the smaller string is always padded with gaps on the left to excess. And I'm looking at one of the results right now and it definitely wouldn't add up to the score that we're getting. Try running the tiny query string against gi|544618446|ref|GU131178.2| Panthera leo isolate 5c cytochrome b gene, complete cds. I can send a screenshot if necessary.

On Sun, Dec 8, 2013 at 2:51 PM, AmateurHour < notifications@github.com>wrote:

Passing in the arguments in reverse order would cause the initial error you showed me with the screenshot, as well as why our scores suffered by 0.2, because of how the initial scoring matrix is built.

I think I had just spent so much time figuring out all the stupid stuff that is unclear in the spec that I completely forgot to let you know how the parameters actually function. I'm sorry for all the confusion and probably frustration. On Dec 8, 2013 2:36 PM, "Jordan Meyer" < jmeye006@ucr.edu>

wrote:

I believe that is happening because you may be using the query as the reference and the organism's sequence as the reference, when it should be the other way around. The reference is the string we are searching for (usually it will be the shorter one), and the query is the one we are questioning against the reference. The query is should be the organism's sequence, and the reference should be the string all searches are based on.

I think this was another mismatch in the spec's portrayal of query vs reference. Originally in our stub for the scoring function, the function signature was def score(query_string, other_string, sub_score, gap_score):

I had changed the signature of my implementation to match the clarification Nick made during lab last week. What I originally thought was the query is actually the reference, and what I thought was just the other string is actually the query.

If you run the tests with the inputs swapped so that what you have labeled as the query is passed in as the reference, and the organism's sequence is passed in as the second argument(as the actual query parameter), the alignments should be correct.

I hope I was clear in explaining what I believe to be the problem.

On Sun, Dec 8, 2013 at 2:15 PM, Jordan Meyer < jmeye006@ucr.edu>

wrote:

I think you're onto something. I will try to fix the problem as soon as I get back to a computer. On Dec 8, 2013 2:10 PM, "John Sullivan" < notifications@github.com>

wrote:

I think your code is shortening the reference sequence you return to be the same size as the query sequence, and we end up losing data. Here's some debugging output from me putzing around...

name: gi|253409428|ref|GQ227366.1| Influenza A virus (A/pika/Qinghai/BI/2007(H5N1)) segment 1 polymerase PB2 (PB2) gene, complete cds score: 39.0

organism.sequence:

ATGGAGAGAATAAAGGAATTAAGAGATCTAATGTCACAGTCCCGCACTCGCGAGATACTAACAAAGACCACTGTGGACCATATGGCCATAATCAAGAAATACACATCAGGAAGACAAGAGAAGAACCCTGCTCTCAGAATGAAATGGATGATGGCAATGAAATATCCAATCACAGCGGACAAGAGAATAATAGAGATGATTCCTGAAAGGAATGAACAAGGACAGACACTCTGGAGCAAGACAAATGATGCTGGATCGGACAGGGTGATGGTGTCTCCCCTAGCTGTAACTTGGTGGAATAGGAATGGGCCGACGACAAGTACAGTTCATTATCCAAAGGTTTACAAAACATACTTTGAGAAGGTTGAAAGGTTAAAACATGGAACCTTCGGTCCCGTTCATTTCCGAAACCAAGTTAAAATACGCCGCCGAGTTGATACAAATCCTGGCCATGCAGATCTCAGTGCTAAAGAAGCACAAGATGTCATCATGGAGGTCGTTTTCCCAAATGAAGTGGGAGCTAGAATATTGACTTCAGAGTCACAGTTGACAATAACGAAAGAGAAAAAAGAAGAGCTCCAAGATTGTAAGATTGCTCCCTTAATGGTTGCATACATGTTGGAAAGGGAACTGGTCCGCAAAACCAGATTCCTACCAGTAGCAGGCGGAACAAGCAGTGTGTACATTGAGGTATTGCATTTGACTCAAGGAACCTGCTGGGCACAGATGTACACTCCAGGCGGAGAAGTAAGAAATGACGATGTTGACCAGAGTTTGATCATTGCTGCCAGAAACATTGTTAGGAGAGCAACGGTATCAGCGGATCCACTGGCATCACTGCTGGAGATGTGTCACAGCACACAAATTGGTGGGATAAGGATGGTGGACATCCTTAGGCAAACTCCAACTGAGGAACAAGCTGTGGATATATGCAAAGCAGCAATGGGTCTGAGGATTAGTTCATCCTTTAGCTTTGGAG

GCTTCACTTTCAAAAGAACAAGTGGATCATCCGCCACGAAGGAAGAGGAAGTGCTTACAGGCAACCTCCAAACATTGAAAATAAGAGTACATGAGGGGTATGAGGAGTTCACAATGGTTGGGCAGAGGGCAACAGCTATCCTGAGGAAAGCAACTAGAAGGCTGATTCAGTTGATAGTAAGTGGAAGAAACGAACAATCAATCGCTGAGGCAATCATTGTAGCAATGGTGTTCTCACAGGAGGATCGCATGATAAAAGCAGTCCGAGGCGATCTGAATTTCGTAAACAGAGCAAACCAAAGATTAAACCCCATGCATCAACTCCTGAGACATTTTCAAAAGGACGCAAAAGTGCTATTTCAGAATTGGGGAACTGAGCCAATTGATAATGTCATGGGGATGATCGGAATATTACCTGACATGACTCCCAGCACAGAAACGTCACTGAGAGGAGTGAGAGTTAGTAAAATGGGAGTAGATGAGTATTCCAGCACTGAGAGAGTAGTTGTAAGCATTGACCGCTTCTTAAGGGTTCGAGACCAGCGGGGGAACGTACTCTTATCTCCCGAAGAGGTCAGCGAAACCCAGGGAACAGAGAAGTTGACAATAACATATTCATCATCAATGATGTGGGAAATCAACGGTCCTGAGTCAGTGCTTGTTAACACTTACCAATGGATCATTAGAAACTGGGAGACCGTGAAAATTCAGTGGTCTCAGGACCCCACGATGTTGTACAATAAGATGGAGTTTGAACCGTTCCAATCCTTGGTACCTAAAGCTGCCAGAGGTCAATACAGTGGATTTGTGAGAACATTATTCCAACAAATGCGTGACGTACTGGGGACATTTGATACTGTCCAGATAATAAAGCTGCTACCATTTGCAGCAGCCCCACCGAAGCAGAGCAGAATGCAGTTTTCTTCTCTAACTGTGAATGTGAGAGGCTCAGGAATGAGAATACTCATAAGGGGCAATTCCCCTGTGTTCAACTACAA

TAAGGCAACCCAAAGACTTACCGTTCTTGGAAAGGACGCAGGTGCATTAACAGAGGATCCAGATGAGGGGACAGCCGGAGTGGAATCTGCAGTACTGAGGGGGTTCCTAATTCTAGGCAAGGAGGACAAAAGATATGGACCAGCATTGAGCATCAATGAACTGAGCAATCTTGCAAAAGGGGAGAAAGCTAATGTGCTGATAGGGCAAGGAGACGTGGTGTTGGTAATGAAACGGAAACGGGACTCTAGCATACTTACTGACAGCCAGACAGCGACCAAAAGAATTCGGATGGCCATCAATTAGTGTCGAATTGTTTAAAAACGACCTTGTTTCTACT

reference_alignment:


query: AGCGAAAGCAGGTCAAATATATTCAATATGGAGAGAATAAAAGAATTAAG

query_alignment: GCGAAAGCAGGTCAAATATATTCAATATGGAGAGAATAAAAGAATTAAG

— Reply to this email directly or view it on GitHub<

https://github.com/galah-group/141-assignment/issues/17#issuecomment-30094331>

.

— Reply to this email directly or view it on GitHub<

https://github.com/galah-group/141-assignment/issues/17#issuecomment-30095435>

.

John Sullivan (johnsullivan.name) Supplemental Instruction Mentor at UCR gpg --recv-keys 6A262D84

— Reply to this email directly or view it on GitHub<

https://github.com/galah-group/141-assignment/issues/17#issuecomment-30095611>

.

— Reply to this email directly or view it on GitHub<

https://github.com/galah-group/141-assignment/issues/17#issuecomment-30099000>

.

John Sullivan (johnsullivan.name) Supplemental Instruction Mentor at UCR gpg --recv-keys 6A262D84

John Sullivan (johnsullivan.name) Supplemental Instruction Mentor at UCR gpg --recv-keys 6A262D84

— Reply to this email directly or view it on GitHub<

https://github.com/galah-group/141-assignment/issues/17#issuecomment-30102799>

.

— Reply to this email directly or view it on GitHub<

https://github.com/galah-group/141-assignment/issues/17#issuecomment-30103137>

.

John Sullivan (johnsullivan.name) Supplemental Instruction Mentor at UCR gpg --recv-keys 6A262D84

— Reply to this email directly or view it on GitHub<

https://github.com/galah-group/141-assignment/issues/17#issuecomment-30103199>

.

— Reply to this email directly or view it on GitHub<

https://github.com/galah-group/141-assignment/issues/17#issuecomment-30103295>

.

John Sullivan (johnsullivan.name) Supplemental Instruction Mentor at UCR gpg --recv-keys 6A262D84

— Reply to this email directly or view it on GitHub<

https://github.com/galah-group/141-assignment/issues/17#issuecomment-30103358>

.

— Reply to this email directly or view it on GitHub< https://github.com/galah-group/141-assignment/issues/17#issuecomment-30103448>

.

John Sullivan (johnsullivan.name) Supplemental Instruction Mentor at UCR gpg --recv-keys 6A262D84

— Reply to this email directly or view it on GitHubhttps://github.com/galah-group/141-assignment/issues/17#issuecomment-30103939 .

itsjohncs commented 10 years ago

I counted up manually and got 15 matches in total but the score is listed as >40. I think you might be counting two gaps lining up as matches. Are two gaps lining up supposed to count as such? I noticed the output function marked them green mistakenly, I pushed a fix for that.

On Sun, Dec 8, 2013 at 8:05 PM, AmateurHour notifications@github.comwrote:

Nick said there would be a lot of padding because leading insertions don't count towards the final score. So the algorithm pads up to the point where the strings begin the highest similarity.

I'm not sure what else could be modified with the algorithm. But the characters should match up correctly so that most of them are perfect matches, and then gaps fill in, because gaps have decreasing cost per additional space, so they cost much less than substitutions (mismatches). On Dec 8, 2013 7:47 PM, "John Sullivan" notifications@github.com wrote:

The algorithm still doesn't seem to be giving the right results. The smaller sequence is padding with gaps to the left a lot and despite the not a lot of symbols matching the score seems to be fairly high. It is no longer erroring though.

On Sun, Dec 8, 2013 at 7:28 PM, AmateurHour notifications@github.comwrote:

Will do. Expect it in 10 minutes. On Dec 8, 2013 7:25 PM, "John Sullivan" notifications@github.com wrote:

I see, the emails are going through the GitHub issue tracker. We're not actually communicating directly but rather placing messages into a forum of sorts. The GitHub issue tracker does not support general file attachments, you'll have to send it to me directly.

On Sun, Dec 8, 2013 at 7:23 PM, AmateurHour < notifications@github.com>wrote:

I sent it in this string of emails we have going. If you cannot find it, I will send it again as soon as I get home. Should be about 10-15 minutes On Dec 8, 2013 7:20 PM, "John Sullivan" notifications@github.com

wrote:

I did not receive an email. Where did you send it? My email is jsull003@ucr.edu

On Sun, Dec 8, 2013 at 7:17 PM, AmateurHour < notifications@github.com>wrote:

Did the file I sent you also not work? On Dec 8, 2013 7:02 PM, "John Sullivan" < notifications@github.com>

wrote:

Any progress on this?

On Sun, Dec 8, 2013 at 5:05 PM, John Sullivan < jsull003@ucr.edu>

wrote:

Made the change, I can confirm that all unit tests still pass. Fails on the longer test though.

~/Projects/141-assignment - master(+3/-2)*
: ± ./run.py data/our_database.txt data/our_tiny_query.txt
-a
-n
3
-vvv[DEBUG] Logging initialized.
[DEBUG] Contents of options: <Values at 0x7f314c620710:
{'num_results':
'3', 'verbose': 3, 'score_file': None, 'no_colors': False,
'quiet':
False,
'print_alignment': True}>
[DEBUG] Contents of args: ['data/our_database.txt',
'data/our_tiny_query.txt']
[DEBUG] Organism 'gi|392583980|ref|CY116651.1| Influenza A
virus
(A/ferret/Indonesia/5-F1/2005(H5N1)) polymerase PB2 (PB2)
gene,
complete
cds' has score 39.2.
[DEBUG] Organism 'gi|291280817|ref|HM012479.1| Influenza A
virus
(A/cheetah/CA/30954/2009(H1N1)) segment 4 hemagglutinin
(HA)
gene,
complete
cds' has score 39.55.
Traceback (most recent call last):
File "./run.py", line 6, in <module>
sys.exit(dnasearch.main.main())
File
"/home/john/Projects/141-assignment/dnasearch/main.py",
line
137,
in main
gap_score)
File
"/home/john/Projects/141-assignment/dnasearch/similarity.py",
line
75, in score
a,b = backtrace(ref,query,bt)
File
"/home/john/Projects/141-assignment/dnasearch/similarity.py",
line
94, in backtrace
a += ref[i+1]
IndexError: string index out of range

I could have just swapped out the code incorrectly through. Could you send me a complete revised copy of similarity.py?

-John

On Sun, Dec 8, 2013 at 5:00 PM, AmateurHour < notifications@github.com>wrote:

Alright, I think changing this part of backtrace:

elif bt[i][j] == INSERTION: if len(b) > 0 and b[len(b)-1] != '' or len(b) == 0: b += query[j] b += '' j -= 1

to this:

elif bt[i][j] == INSERTION: if len(b) > 0 and b[len(b)-1] != '' or len(b) == 0: b += query[j] b += '' a += '_' j -= 1 yields the proper alignment. The score makes much more sense like that. The single line of code different will not alter the score, and still results in the correct results for the original test cases.

On Sun, Dec 8, 2013 at 2:58 PM, John Sullivan < notifications@github.com>wrote:

No problem. Looks like there might be another issue though. I don't think we're getting the ideal alignment, and I'm still not sure the score is being calculated correctly.

It looks like the smaller string is always padded with gaps on the left to excess. And I'm looking at one of the results right now and it definitely wouldn't add up to the score that we're getting. Try running the tiny query string against gi|544618446|ref|GU131178.2| Panthera leo isolate 5c cytochrome b gene, complete cds. I can send a screenshot if necessary.

On Sun, Dec 8, 2013 at 2:51 PM, AmateurHour < notifications@github.com>wrote:

Passing in the arguments in reverse order would cause the initial error you showed me with the screenshot, as well as why our scores suffered by 0.2, because of how the initial scoring matrix is built.

I think I had just spent so much time figuring out all the stupid stuff that is unclear in the spec that I completely forgot to let you know how the parameters actually function. I'm sorry for all the confusion and probably frustration. On Dec 8, 2013 2:36 PM, "Jordan Meyer" < jmeye006@ucr.edu>

wrote:

I believe that is happening because you may be using the query as the reference and the organism's sequence as the reference, when it should be the other way around. The reference is the string we are searching for (usually it will be the shorter one), and the query is the one we are questioning against the reference. The query is should be the organism's sequence, and the reference should be the string all searches are based on.

I think this was another mismatch in the spec's portrayal of query vs reference. Originally in our stub for the scoring function, the function signature was def score(query_string, other_string, sub_score, gap_score):

I had changed the signature of my implementation to match the clarification Nick made during lab last week. What I originally thought was the query is actually the reference, and what I thought was just the other string is actually the query.

If you run the tests with the inputs swapped so that what you have labeled as the query is passed in as the reference, and the organism's sequence is passed in as the second argument(as the actual query parameter), the alignments should be correct.

I hope I was clear in explaining what I believe to be the problem.

On Sun, Dec 8, 2013 at 2:15 PM, Jordan Meyer < jmeye006@ucr.edu>

wrote:

I think you're onto something. I will try to fix the problem as soon as I get back to a computer. On Dec 8, 2013 2:10 PM, "John Sullivan" < notifications@github.com>

wrote:

I think your code is shortening the reference sequence you return to be the same size as the query sequence, and we end up losing data. Here's some debugging output from me putzing around...

name: gi|253409428|ref|GQ227366.1| Influenza A virus (A/pika/Qinghai/BI/2007(H5N1)) segment 1 polymerase PB2 (PB2) gene, complete cds score: 39.0

organism.sequence:

ATGGAGAGAATAAAGGAATTAAGAGATCTAATGTCACAGTCCCGCACTCGCGAGATACTAACAAAGACCACTGTGGACCATATGGCCATAATCAAGAAATACACATCAGGAAGACAAGAGAAGAACCCTGCTCTCAGAATGAAATGGATGATGGCAATGAAATATCCAATCACAGCGGACAAGAGAATAATAGAGATGATTCCTGAAAGGAATGAACAAGGACAGACACTCTGGAGCAAGACAAATGATGCTGGATCGGACAGGGTGATGGTGTCTCCCCTAGCTGTAACTTGGTGGAATAGGAATGGGCCGACGACAAGTACAGTTCATTATCCAAAGGTTTACAAAACATACTTTGAGAAGGTTGAAAGGTTAAAACATGGAACCTTCGGTCCCGTTCATTTCCGAAACCAAGTTAAAATACGCCGCCGAGTTGATACAAATCCTGGCCATGCAGATCTCAGTGCTAAAGAAGCACAAGATGTCATCATGGAGGTCGTTTTCCCAAATGAAGTGGGAGCTAGAATATTGACTTCAGAGTCACAGTTGACAATAACGAAAGAGAAAAAAGAAGAGCTCCAAGATTGTAAGATTGCTCCCTTAATGGTTGCATACATGTTGGAAAGGGAACTGGTCCGCAAAACCAGATTCCTACCAGTAGCAGGCGGAACAAGCAGTGTGTACATTGAGGTATTGCATTTGACTCAAGGAACCTGCTGGGCACAGATGTACACTCCAGGCGGAGAAGTAAGAAATGACGATGTTGACCAGAGTTTGATCATTGCTGCCAGAAACATTGTTAGGAGAGCAACGGTATCAGCGGATCCACTGGCATCACTGCTGGAGATGTGTCACAGCACACAAATTGGTGGGATAAGGATGGTGGACATCCTTAGGCAAACTCCAACTGAGGAACAAGCTGTGGATATATGCAAAGCAGCAATGGGTCTGAGGATTAGTTCATCCTTTAGCTTTGGAG

GCTTCACTTTCAAAAGAACAAGTGGATCATCCGCCACGAAGGAAGAGGAAGTGCTTACAGGCAACCTCCAAACATTGAAAATAAGAGTACATGAGGGGTATGAGGAGTTCACAATGGTTGGGCAGAGGGCAACAGCTATCCTGAGGAAAGCAACTAGAAGGCTGATTCAGTTGATAGTAAGTGGAAGAAACGAACAATCAATCGCTGAGGCAATCATTGTAGCAATGGTGTTCTCACAGGAGGATCGCATGATAAAAGCAGTCCGAGGCGATCTGAATTTCGTAAACAGAGCAAACCAAAGATTAAACCCCATGCATCAACTCCTGAGACATTTTCAAAAGGACGCAAAAGTGCTATTTCAGAATTGGGGAACTGAGCCAATTGATAATGTCATGGGGATGATCGGAATATTACCTGACATGACTCCCAGCACAGAAACGTCACTGAGAGGAGTGAGAGTTAGTAAAATGGGAGTAGATGAGTATTCCAGCACTGAGAGAGTAGTTGTAAGCATTGACCGCTTCTTAAGGGTTCGAGACCAGCGGGGGAACGTACTCTTATCTCCCGAAGAGGTCAGCGAAACCCAGGGAACAGAGAAGTTGACAATAACATATTCATCATCAATGATGTGGGAAATCAACGGTCCTGAGTCAGTGCTTGTTAACACTTACCAATGGATCATTAGAAACTGGGAGACCGTGAAAATTCAGTGGTCTCAGGACCCCACGATGTTGTACAATAAGATGGAGTTTGAACCGTTCCAATCCTTGGTACCTAAAGCTGCCAGAGGTCAATACAGTGGATTTGTGAGAACATTATTCCAACAAATGCGTGACGTACTGGGGACATTTGATACTGTCCAGATAATAAAGCTGCTACCATTTGCAGCAGCCCCACCGAAGCAGAGCAGAATGCAGTTTTCTTCTCTAACTGTGAATGTGAGAGGCTCAGGAATGAGAATACTCATAAGGGGCAATTCCCCTGTGTTCAACTACAA

TAAGGCAACCCAAAGACTTACCGTTCTTGGAAAGGACGCAGGTGCATTAACAGAGGATCCAGATGAGGGGACAGCCGGAGTGGAATCTGCAGTACTGAGGGGGTTCCTAATTCTAGGCAAGGAGGACAAAAGATATGGACCAGCATTGAGCATCAATGAACTGAGCAATCTTGCAAAAGGGGAGAAAGCTAATGTGCTGATAGGGCAAGGAGACGTGGTGTTGGTAATGAAACGGAAACGGGACTCTAGCATACTTACTGACAGCCAGACAGCGACCAAAAGAATTCGGATGGCCATCAATTAGTGTCGAATTGTTTAAAAACGACCTTGTTTCTACT

reference_alignment:


query: AGCGAAAGCAGGTCAAATATATTCAATATGGAGAGAATAAAAGAATTAAG

query_alignment: GCGAAAGCAGGTCAAATATATTCAATATGGAGAGAATAAAAGAATTAAG

— Reply to this email directly or view it on GitHub<

https://github.com/galah-group/141-assignment/issues/17#issuecomment-30094331>

.

— Reply to this email directly or view it on GitHub<

https://github.com/galah-group/141-assignment/issues/17#issuecomment-30095435>

.

John Sullivan (johnsullivan.name) Supplemental Instruction Mentor at UCR gpg --recv-keys 6A262D84

— Reply to this email directly or view it on GitHub<

https://github.com/galah-group/141-assignment/issues/17#issuecomment-30095611>

.

— Reply to this email directly or view it on GitHub<

https://github.com/galah-group/141-assignment/issues/17#issuecomment-30099000>

.

John Sullivan (johnsullivan.name) Supplemental Instruction Mentor at UCR gpg --recv-keys 6A262D84

John Sullivan (johnsullivan.name) Supplemental Instruction Mentor at UCR gpg --recv-keys 6A262D84

— Reply to this email directly or view it on GitHub<

https://github.com/galah-group/141-assignment/issues/17#issuecomment-30102799>

.

— Reply to this email directly or view it on GitHub<

https://github.com/galah-group/141-assignment/issues/17#issuecomment-30103137>

.

John Sullivan (johnsullivan.name) Supplemental Instruction Mentor at UCR gpg --recv-keys 6A262D84

— Reply to this email directly or view it on GitHub<

https://github.com/galah-group/141-assignment/issues/17#issuecomment-30103199>

.

— Reply to this email directly or view it on GitHub<

https://github.com/galah-group/141-assignment/issues/17#issuecomment-30103295>

.

John Sullivan (johnsullivan.name) Supplemental Instruction Mentor at UCR gpg --recv-keys 6A262D84

— Reply to this email directly or view it on GitHub<

https://github.com/galah-group/141-assignment/issues/17#issuecomment-30103358>

.

— Reply to this email directly or view it on GitHub<

https://github.com/galah-group/141-assignment/issues/17#issuecomment-30103448>

.

John Sullivan (johnsullivan.name) Supplemental Instruction Mentor at UCR gpg --recv-keys 6A262D84

— Reply to this email directly or view it on GitHub< https://github.com/galah-group/141-assignment/issues/17#issuecomment-30103939>

.

— Reply to this email directly or view it on GitHubhttps://github.com/galah-group/141-assignment/issues/17#issuecomment-30104461 .

John Sullivan (johnsullivan.name) Supplemental Instruction Mentor at UCR gpg --recv-keys 6A262D84

jordanjmeyer commented 10 years ago

Alright, the probably is definitely in the backtracing function then.

Luckily, both Nick and Izbicki said the dominant portion of grading would be scoring.

I am trying to find the problem, but I am unsure if I will be able to fix it.

On Sun, Dec 8, 2013 at 8:12 PM, John Sullivan notifications@github.comwrote:

I counted up manually and got 15 matches in total but the score is listed as >40. I think you might be counting two gaps lining up as matches. Are two gaps lining up supposed to count as such? I noticed the output function marked them green mistakenly, I pushed a fix for that.

On Sun, Dec 8, 2013 at 8:05 PM, AmateurHour notifications@github.comwrote:

Nick said there would be a lot of padding because leading insertions don't count towards the final score. So the algorithm pads up to the point where the strings begin the highest similarity.

I'm not sure what else could be modified with the algorithm. But the characters should match up correctly so that most of them are perfect matches, and then gaps fill in, because gaps have decreasing cost per additional space, so they cost much less than substitutions (mismatches). On Dec 8, 2013 7:47 PM, "John Sullivan" notifications@github.com wrote:

The algorithm still doesn't seem to be giving the right results. The smaller sequence is padding with gaps to the left a lot and despite the not a lot of symbols matching the score seems to be fairly high. It is no longer erroring though.

On Sun, Dec 8, 2013 at 7:28 PM, AmateurHour notifications@github.comwrote:

Will do. Expect it in 10 minutes. On Dec 8, 2013 7:25 PM, "John Sullivan" notifications@github.com wrote:

I see, the emails are going through the GitHub issue tracker. We're not actually communicating directly but rather placing messages into a forum of sorts. The GitHub issue tracker does not support general file attachments, you'll have to send it to me directly.

On Sun, Dec 8, 2013 at 7:23 PM, AmateurHour < notifications@github.com>wrote:

I sent it in this string of emails we have going. If you cannot find it, I will send it again as soon as I get home. Should be about 10-15 minutes On Dec 8, 2013 7:20 PM, "John Sullivan" < notifications@github.com>

wrote:

I did not receive an email. Where did you send it? My email is jsull003@ucr.edu

On Sun, Dec 8, 2013 at 7:17 PM, AmateurHour < notifications@github.com>wrote:

Did the file I sent you also not work? On Dec 8, 2013 7:02 PM, "John Sullivan" < notifications@github.com>

wrote:

Any progress on this?

On Sun, Dec 8, 2013 at 5:05 PM, John Sullivan < jsull003@ucr.edu>

wrote:

Made the change, I can confirm that all unit tests still pass. Fails on the longer test though.

~/Projects/141-assignment - master(+3/-2)*
: ± ./run.py data/our_database.txt
data/our_tiny_query.txt
-a
-n
3
-vvv[DEBUG] Logging initialized.
[DEBUG] Contents of options: <Values at 0x7f314c620710:
{'num_results':
'3', 'verbose': 3, 'score_file': None, 'no_colors':
False,
'quiet':
False,
'print_alignment': True}>
[DEBUG] Contents of args: ['data/our_database.txt',
'data/our_tiny_query.txt']
[DEBUG] Organism 'gi|392583980|ref|CY116651.1| Influenza
A
virus
(A/ferret/Indonesia/5-F1/2005(H5N1)) polymerase PB2
(PB2)
gene,
complete
cds' has score 39.2.
[DEBUG] Organism 'gi|291280817|ref|HM012479.1| Influenza
A
virus
(A/cheetah/CA/30954/2009(H1N1)) segment 4 hemagglutinin
(HA)
gene,
complete
cds' has score 39.55.
Traceback (most recent call last):
File "./run.py", line 6, in <module>
sys.exit(dnasearch.main.main())
File
"/home/john/Projects/141-assignment/dnasearch/main.py",
line
137,
in main
gap_score)
File
"/home/john/Projects/141-assignment/dnasearch/similarity.py",
line
75, in score
a,b = backtrace(ref,query,bt)
File
"/home/john/Projects/141-assignment/dnasearch/similarity.py",
line
94, in backtrace
a += ref[i+1]
IndexError: string index out of range

I could have just swapped out the code incorrectly through. Could you send me a complete revised copy of similarity.py?

-John

On Sun, Dec 8, 2013 at 5:00 PM, AmateurHour < notifications@github.com>wrote:

Alright, I think changing this part of backtrace:

elif bt[i][j] == INSERTION: if len(b) > 0 and b[len(b)-1] != '' or len(b) == 0: b += query[j] b += '' j -= 1

to this:

elif bt[i][j] == INSERTION: if len(b) > 0 and b[len(b)-1] != '' or len(b) == 0: b += query[j] b += '' a += '_' j -= 1 yields the proper alignment. The score makes much more sense like that. The single line of code different will not alter the score, and still results in the correct results for the original test cases.

On Sun, Dec 8, 2013 at 2:58 PM, John Sullivan < notifications@github.com>wrote:

No problem. Looks like there might be another issue though. I don't think we're getting the ideal alignment, and I'm still not sure the score is being calculated correctly.

It looks like the smaller string is always padded with gaps on the left to excess. And I'm looking at one of the results right now and it definitely wouldn't add up to the score that we're getting. Try running the tiny query string against gi|544618446|ref|GU131178.2| Panthera leo isolate 5c cytochrome b gene, complete cds. I can send a screenshot if necessary.

On Sun, Dec 8, 2013 at 2:51 PM, AmateurHour < notifications@github.com>wrote:

Passing in the arguments in reverse order would cause the initial error you showed me with the screenshot, as well as why our scores suffered by 0.2, because of how the initial scoring matrix is built.

I think I had just spent so much time figuring out all the stupid stuff that is unclear in the spec that I completely forgot to let you know how the parameters actually function. I'm sorry for all the confusion and probably frustration. On Dec 8, 2013 2:36 PM, "Jordan Meyer" < jmeye006@ucr.edu>

wrote:

I believe that is happening because you may be using the query as the reference and the organism's sequence as the reference, when it should be the other way around. The reference is the string we are searching for (usually it will be the shorter one), and the query is the one we are questioning against the reference. The query is should be the organism's sequence, and the reference should be the string all searches are based on.

I think this was another mismatch in the spec's portrayal of query vs reference. Originally in our stub for the scoring function, the function signature was def score(query_string, other_string, sub_score, gap_score):

I had changed the signature of my implementation to match the clarification Nick made during lab last week. What I originally thought was the query is actually the reference, and what I thought was just the other string is actually the query.

If you run the tests with the inputs swapped so that what you have labeled as the query is passed in as the reference, and the organism's sequence is passed in as the second argument(as the actual query parameter), the alignments should be correct.

I hope I was clear in explaining what I believe to be the problem.

On Sun, Dec 8, 2013 at 2:15 PM, Jordan Meyer < jmeye006@ucr.edu>

wrote:

I think you're onto something. I will try to fix the problem as soon as I get back to a computer. On Dec 8, 2013 2:10 PM, "John Sullivan" < notifications@github.com>

wrote:

I think your code is shortening the reference sequence you return to be the same size as the query sequence, and we end up losing data. Here's some debugging output from me putzing around...

name: gi|253409428|ref|GQ227366.1| Influenza A virus (A/pika/Qinghai/BI/2007(H5N1)) segment 1 polymerase PB2 (PB2) gene, complete cds score: 39.0

organism.sequence:

ATGGAGAGAATAAAGGAATTAAGAGATCTAATGTCACAGTCCCGCACTCGCGAGATACTAACAAAGACCACTGTGGACCATATGGCCATAATCAAGAAATACACATCAGGAAGACAAGAGAAGAACCCTGCTCTCAGAATGAAATGGATGATGGCAATGAAATATCCAATCACAGCGGACAAGAGAATAATAGAGATGATTCCTGAAAGGAATGAACAAGGACAGACACTCTGGAGCAAGACAAATGATGCTGGATCGGACAGGGTGATGGTGTCTCCCCTAGCTGTAACTTGGTGGAATAGGAATGGGCCGACGACAAGTACAGTTCATTATCCAAAGGTTTACAAAACATACTTTGAGAAGGTTGAAAGGTTAAAACATGGAACCTTCGGTCCCGTTCATTTCCGAAACCAAGTTAAAATACGCCGCCGAGTTGATACAAATCCTGGCCATGCAGATCTCAGTGCTAAAGAAGCACAAGATGTCATCATGGAGGTCGTTTTCCCAAATGAAGTGGGAGCTAGAATATTGACTTCAGAGTCACAGTTGACAATAACGAAAGAGAAAAAAGAAGAGCTCCAAGATTGTAAGATTGCTCCCTTAATGGTTGCATACATGTTGGAAAGGGAACTGGTCCGCAAAACCAGATTCCTACCAGTAGCAGGCGGAACAAGCAGTGTGTACATTGAGGTATTGCATTTGACTCAAGGAACCTGCTGGGCACAGATGTACACTCCAGGCGGAGAAGTAAGAAATGACGATGTTGACCAGAGTTTGATCATTGCTGCCAGAAACATTGTTAGGAGAGCAACGGTATCAGCGGATCCACTGGCATCACTGCTGGAGATGTGTCACAGCACACAAATTGGTGGGATAAGGATGGTGGACATCCTTAGGCAAACTCCAACTGAGGAACAAGCTGTGGATATATGCAAAGCAGCAATGGGTCTGAGGATTAGTTCATCCTTTAGCTTTGGAG

GCTTCACTTTCAAAAGAACAAGTGGATCATCCGCCACGAAGGAAGAGGAAGTGCTTACAGGCAACCTCCAAACATTGAAAATAAGAGTACATGAGGGGTATGAGGAGTTCACAATGGTTGGGCAGAGGGCAACAGCTATCCTGAGGAAAGCAACTAGAAGGCTGATTCAGTTGATAGTAAGTGGAAGAAACGAACAATCAATCGCTGAGGCAATCATTGTAGCAATGGTGTTCTCACAGGAGGATCGCATGATAAAAGCAGTCCGAGGCGATCTGAATTTCGTAAACAGAGCAAACCAAAGATTAAACCCCATGCATCAACTCCTGAGACATTTTCAAAAGGACGCAAAAGTGCTATTTCAGAATTGGGGAACTGAGCCAATTGATAATGTCATGGGGATGATCGGAATATTACCTGACATGACTCCCAGCACAGAAACGTCACTGAGAGGAGTGAGAGTTAGTAAAATGGGAGTAGATGAGTATTCCAGCACTGAGAGAGTAGTTGTAAGCATTGACCGCTTCTTAAGGGTTCGAGACCAGCGGGGGAACGTACTCTTATCTCCCGAAGAGGTCAGCGAAACCCAGGGAACAGAGAAGTTGACAATAACATATTCATCATCAATGATGTGGGAAATCAACGGTCCTGAGTCAGTGCTTGTTAACACTTACCAATGGATCATTAGAAACTGGGAGACCGTGAAAATTCAGTGGTCTCAGGACCCCACGATGTTGTACAATAAGATGGAGTTTGAACCGTTCCAATCCTTGGTACCTAAAGCTGCCAGAGGTCAATACAGTGGATTTGTGAGAACATTATTCCAACAAATGCGTGACGTACTGGGGACATTTGATACTGTCCAGATAATAAAGCTGCTACCATTTGCAGCAGCCCCACCGAAGCAGAGCAGAATGCAGTTTTCTTCTCTAACTGTGAATGTGAGAGGCTCAGGAATGAGAATACTCATAAGGGGCAATTCCCCTGTGTTCAACTACAA

TAAGGCAACCCAAAGACTTACCGTTCTTGGAAAGGACGCAGGTGCATTAACAGAGGATCCAGATGAGGGGACAGCCGGAGTGGAATCTGCAGTACTGAGGGGGTTCCTAATTCTAGGCAAGGAGGACAAAAGATATGGACCAGCATTGAGCATCAATGAACTGAGCAATCTTGCAAAAGGGGAGAAAGCTAATGTGCTGATAGGGCAAGGAGACGTGGTGTTGGTAATGAAACGGAAACGGGACTCTAGCATACTTACTGACAGCCAGACAGCGACCAAAAGAATTCGGATGGCCATCAATTAGTGTCGAATTGTTTAAAAACGACCTTGTTTCTACT

reference_alignment:


query: AGCGAAAGCAGGTCAAATATATTCAATATGGAGAGAATAAAAGAATTAAG

query_alignment: GCGAAAGCAGGTCAAATATATTCAATATGGAGAGAATAAAAGAATTAAG

— Reply to this email directly or view it on GitHub<

https://github.com/galah-group/141-assignment/issues/17#issuecomment-30094331>

.

— Reply to this email directly or view it on GitHub<

https://github.com/galah-group/141-assignment/issues/17#issuecomment-30095435>

.

John Sullivan (johnsullivan.name) Supplemental Instruction Mentor at UCR gpg --recv-keys 6A262D84

— Reply to this email directly or view it on GitHub<

https://github.com/galah-group/141-assignment/issues/17#issuecomment-30095611>

.

— Reply to this email directly or view it on GitHub<

https://github.com/galah-group/141-assignment/issues/17#issuecomment-30099000>

.

John Sullivan (johnsullivan.name) Supplemental Instruction Mentor at UCR gpg --recv-keys 6A262D84

John Sullivan (johnsullivan.name) Supplemental Instruction Mentor at UCR gpg --recv-keys 6A262D84

— Reply to this email directly or view it on GitHub<

https://github.com/galah-group/141-assignment/issues/17#issuecomment-30102799>

.

— Reply to this email directly or view it on GitHub<

https://github.com/galah-group/141-assignment/issues/17#issuecomment-30103137>

.

John Sullivan (johnsullivan.name) Supplemental Instruction Mentor at UCR gpg --recv-keys 6A262D84

— Reply to this email directly or view it on GitHub<

https://github.com/galah-group/141-assignment/issues/17#issuecomment-30103199>

.

— Reply to this email directly or view it on GitHub<

https://github.com/galah-group/141-assignment/issues/17#issuecomment-30103295>

.

John Sullivan (johnsullivan.name) Supplemental Instruction Mentor at UCR gpg --recv-keys 6A262D84

— Reply to this email directly or view it on GitHub<

https://github.com/galah-group/141-assignment/issues/17#issuecomment-30103358>

.

— Reply to this email directly or view it on GitHub<

https://github.com/galah-group/141-assignment/issues/17#issuecomment-30103448>

.

John Sullivan (johnsullivan.name) Supplemental Instruction Mentor at UCR gpg --recv-keys 6A262D84

— Reply to this email directly or view it on GitHub<

https://github.com/galah-group/141-assignment/issues/17#issuecomment-30103939>

.

— Reply to this email directly or view it on GitHub< https://github.com/galah-group/141-assignment/issues/17#issuecomment-30104461>

.

John Sullivan (johnsullivan.name) Supplemental Instruction Mentor at UCR gpg --recv-keys 6A262D84

—

Reply to this email directly or view it on GitHubhttps://github.com/galah-group/141-assignment/issues/17#issuecomment-30104628 .