paul-shannon / slexil

Software Linking Elan Xml to Illuminated Language
MIT License
0 stars 1 forks source link

test_lokono failure #15

Closed davidjamesbeck closed 5 years ago

davidjamesbeck commented 5 years ago

running make from slexil/tests

--- test_lokono_line_3 Traceback (most recent call last): File "test_IjalLine.py", line 402, in runTests() File "test_IjalLine.py", line 16, in runTests test_lokono_line_3() # each morpheme and gloss are separate xml tier elements File "test_IjalLine.py", line 84, in test_lokono_line_3 assert(x3.getTranslation() == "‘[a] child, a woman as well.'") AssertionError make: *** [ijalLine] Error 1

paul-shannon commented 5 years ago

thanks @davidjamesbeck I think that my life-long neglect - indeed, avoidance - of curly quotes now must end! I inserted pdb.set_trace() just before the error in test_IjalLine.py. Maybe you know this technique? It pauses execution, drops you into the python debugger ("pdb"), and gives you a python prompt from which you can explore things in their actual run-time context.

Here is what I see:

x3.getTranslation()
"‘[a] child, a woman as well.'’"

testData/lokono/LOKONO_IJAL_2.eaf has this: image

which seems to open with an opening curly single quote and close with a straight quote. What should our strategy be? Maybe something like this?

However I am confused by the use of quotation marks - of any sort - in the speech and translation lines. In the lokono eaf, they appear to be present everywhere, thereby - it seems to me - adding no information.

Can you formulate a policy? Set me straight? I'll be glad to implement what you propose.

davidjamesbeck commented 5 years ago

Hi, Paul

I have an IDE that I can use to set traces, I didn’t know (though I am not surprised) that you could do it programmatically as well. Nice to find out.

So, the basic rule for quotes in interlinearization:

1) all free translations begin with ‘ and end with ’

Seems simple, but humans being what they are, it doesn’t always work that way, so we need to catch the following errors:

2) extraneous whitespace on either end of the gloss, or following ‘/preceding ’ 3) lack of quotes on the translation at all 4) inconsistent use (people forget them) 5) use of straight apostrophes instead of quotes

Of these, only the last is hard to deal with because, in some rare cases, the straight apostrophe could be a symbol from a practical orthography and there could be a line that includes a non-English word (e.g., a name) in final position in a text that doesn’t use the single quotes. So, automatically replacing straight apostrophes on either end of the gloss with single quotes has a marginal probability of introducing errors. I’m inclined not to worry about this—the chances are low that all three conditions (1. practical orthography with “ ‘ “, 2. word ending with “ ‘ “ in final positon of translation, 3. author isn’t using single quotes in the first place in translations) will co-occur, and the introduced error is both minor and something the author should catch when proofreading the final HTML. Lokono fails because of #5, which is ultimately the author’s error.

Things might be more complicated in cases of direct speech. The usual practice is to use double quotes (in addition to the single quotes surrounding the whole translation), and many authors use the straight double apostrophe. Ideally, we would replace the straight double apostrophe with the “ and end with ”, but that becomes a tad complicated because:

1) we need to distinguish the environments for ‘ from ’ (not too hard with a regex) 2) we need to ensure that there is a space between a double quote and an adjacent single quote (ideally a thin space, but I don’t know if that exists in HTML) 3) we need to make sure punctuation is used correctly around the quotes (as far as this is possible programmatically)

I would be in favour of leaving most of that to the authors, though 1 and 2 might be relatively painless, depending on where and how it could be implemented.

David

On Mar 21, 2019, at 7:28 AM, Paul Shannon notifications@github.com wrote:

thanks @davidjamesbeck https://github.com/davidjamesbeck I think that my life-long neglect - indeed, avoidance - of curly quotes now must end! I inserted pdb.set_trace() just before the error in test_IjalLine.py. Maybe you know this technique? It pauses execution, drops you into the python debugger ("pdb"), and gives you a python prompt from which you can explore things in their actual run-time context.

Here is what I see:

x3.getTranslation() "‘[a] child, a woman as well.'’" testData/lokono/LOKONO_IJAL_2.eaf has this: https://user-images.githubusercontent.com/2480712/54754698-59356480-4ba1-11e9-99d6-b6c168ea852d.png which seems to open with an opening curly single quote and close with a straight quote. What should our strategy be? Maybe something like this?

quoted speech, at the end of our processing, should always use opening and closing curly quotes any combination of paired curly or straight single quotes in the eaf input is accepted However I am confused by the use of quotation marks - of any sort - in the speech and translation lines. In the lokono eaf, they appear to be present everywhere, thereby - it seems to me - adding no information.

Can you formulate a policy? Set me straight? I'll be glad to implement what you propose.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/paul-shannon/slexil/issues/15#issuecomment-475228303, or mute the thread https://github.com/notifications/unsubscribe-auth/ApvDhqwq0o12eTuir6pZ6o989kGQKp-Aks5vY4kEgaJpZM4cBYEo.

paul-shannon commented 5 years ago

Hi David,

Maybe we could hand-craft an eaf file, a mini-text, which contains all of the hard cases we need to solve, separately and in combination? Extra white space, mixed use of curly and straight single and double quotes, no quotes, a line with direct speech, a straight apostrophe from a practical orthography?

On Mar 21, 2019, at 7:57 AM, David Beck notifications@github.com wrote:

Hi, Paul

I have an IDE that I can use to set traces, I didn’t know (though I am not surprised) that you could do it programmatically as well. Nice to find out.

So, the basic rule for quotes in interlinearization:

1) all free translations begin with ‘ and end with ’

Seems simple, but humans being what they are, it doesn’t always work that way, so we need to catch the following errors:

2) extraneous whitespace on either end of the gloss, or following ‘/preceding ’ 3) lack of quotes on the translation at all 4) inconsistent use (people forget them) 5) use of straight apostrophes instead of quotes

Of these, only the last is hard to deal with because, in some rare cases, the straight apostrophe could be a symbol from a practical orthography and there could be a line that includes a non-English word (e.g., a name) in final position in a text that doesn’t use the single quotes. So, automatically replacing straight apostrophes on either end of the gloss with single quotes has a marginal probability of introducing errors. I’m inclined not to worry about this—the chances are low that all three conditions (1. practical orthography with “ ‘ “, 2. word ending with “ ‘ “ in final positon of translation, 3. author isn’t using single quotes in the first place in translations) will co-occur, and the introduced error is both minor and something the author should catch when proofreading the final HTML. Lokono fails because of #5, which is ultimately the author’s error.

Things might be more complicated in cases of direct speech. The usual practice is to use double quotes (in addition to the single quotes surrounding the whole translation), and many authors use the straight double apostrophe. Ideally, we would replace the straight double apostrophe with the “ and end with ”, but that becomes a tad complicated because:

1) we need to distinguish the environments for ‘ from ’ (not too hard with a regex) 2) we need to ensure that there is a space between a double quote and an adjacent single quote (ideally a thin space, but I don’t know if that exists in HTML) 3) we need to make sure punctuation is used correctly around the quotes (as far as this is possible programmatically)

I would be in favour of leaving most of that to the authors, though 1 and 2 might be relatively painless, depending on where and how it could be implemented.

David

On Mar 21, 2019, at 7:28 AM, Paul Shannon notifications@github.com wrote:

thanks @davidjamesbeck https://github.com/davidjamesbeck I think that my life-long neglect - indeed, avoidance - of curly quotes now must end! I inserted pdb.set_trace() just before the error in test_IjalLine.py. Maybe you know this technique? It pauses execution, drops you into the python debugger ("pdb"), and gives you a python prompt from which you can explore things in their actual run-time context.

Here is what I see:

x3.getTranslation() "‘[a] child, a woman as well.'’" testData/lokono/LOKONO_IJAL_2.eaf has this: https://user-images.githubusercontent.com/2480712/54754698-59356480-4ba1-11e9-99d6-b6c168ea852d.png which seems to open with an opening curly single quote and close with a straight quote. What should our strategy be? Maybe something like this?

quoted speech, at the end of our processing, should always use opening and closing curly quotes any combination of paired curly or straight single quotes in the eaf input is accepted However I am confused by the use of quotation marks - of any sort - in the speech and translation lines. In the lokono eaf, they appear to be present everywhere, thereby - it seems to me - adding no information.

Can you formulate a policy? Set me straight? I'll be glad to implement what you propose.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/paul-shannon/slexil/issues/15#issuecomment-475228303, or mute the thread https://github.com/notifications/unsubscribe-auth/ApvDhqwq0o12eTuir6pZ6o989kGQKp-Aks5vY4kEgaJpZM4cBYEo.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

davidjamesbeck commented 5 years ago

Okay, I can do that (over the next couple of days).

David

On Mar 21, 2019, at 9:05 AM, Paul Shannon notifications@github.com wrote:

Hi David,

Maybe we could hand-craft an eaf file, a mini-text, which contains all of the hard cases we need to solve, separately and in combination? Extra white space, mixed use of curly and straight single and double quotes, no quotes, a line with direct speech, a straight apostrophe from a practical orthography?

  • Paul

On Mar 21, 2019, at 7:57 AM, David Beck notifications@github.com wrote:

Hi, Paul

I have an IDE that I can use to set traces, I didn’t know (though I am not surprised) that you could do it programmatically as well. Nice to find out.

So, the basic rule for quotes in interlinearization:

1) all free translations begin with ‘ and end with ’

Seems simple, but humans being what they are, it doesn’t always work that way, so we need to catch the following errors:

2) extraneous whitespace on either end of the gloss, or following ‘/preceding ’ 3) lack of quotes on the translation at all 4) inconsistent use (people forget them) 5) use of straight apostrophes instead of quotes

Of these, only the last is hard to deal with because, in some rare cases, the straight apostrophe could be a symbol from a practical orthography and there could be a line that includes a non-English word (e.g., a name) in final position in a text that doesn’t use the single quotes. So, automatically replacing straight apostrophes on either end of the gloss with single quotes has a marginal probability of introducing errors. I’m inclined not to worry about this—the chances are low that all three conditions (1. practical orthography with “ ‘ “, 2. word ending with “ ‘ “ in final positon of translation, 3. author isn’t using single quotes in the first place in translations) will co-occur, and the introduced error is both minor and something the author should catch when proofreading the final HTML. Lokono fails because of #5, which is ultimately the author’s error.

Things might be more complicated in cases of direct speech. The usual practice is to use double quotes (in addition to the single quotes surrounding the whole translation), and many authors use the straight double apostrophe. Ideally, we would replace the straight double apostrophe with the “ and end with ”, but that becomes a tad complicated because:

1) we need to distinguish the environments for ‘ from ’ (not too hard with a regex) 2) we need to ensure that there is a space between a double quote and an adjacent single quote (ideally a thin space, but I don’t know if that exists in HTML) 3) we need to make sure punctuation is used correctly around the quotes (as far as this is possible programmatically)

I would be in favour of leaving most of that to the authors, though 1 and 2 might be relatively painless, depending on where and how it could be implemented.

David

On Mar 21, 2019, at 7:28 AM, Paul Shannon notifications@github.com wrote:

thanks @davidjamesbeck https://github.com/davidjamesbeck I think that my life-long neglect - indeed, avoidance - of curly quotes now must end! I inserted pdb.set_trace() just before the error in test_IjalLine.py. Maybe you know this technique? It pauses execution, drops you into the python debugger ("pdb"), and gives you a python prompt from which you can explore things in their actual run-time context.

Here is what I see:

x3.getTranslation() "‘[a] child, a woman as well.'’" testData/lokono/LOKONO_IJAL_2.eaf has this: https://user-images.githubusercontent.com/2480712/54754698-59356480-4ba1-11e9-99d6-b6c168ea852d.png which seems to open with an opening curly single quote and close with a straight quote. What should our strategy be? Maybe something like this?

quoted speech, at the end of our processing, should always use opening and closing curly quotes any combination of paired curly or straight single quotes in the eaf input is accepted However I am confused by the use of quotation marks - of any sort - in the speech and translation lines. In the lokono eaf, they appear to be present everywhere, thereby - it seems to me - adding no information.

Can you formulate a policy? Set me straight? I'll be glad to implement what you propose.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/paul-shannon/slexil/issues/15#issuecomment-475228303, or mute the thread https://github.com/notifications/unsubscribe-auth/ApvDhqwq0o12eTuir6pZ6o989kGQKp-Aks5vY4kEgaJpZM4cBYEo.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/paul-shannon/slexil/issues/15#issuecomment-475267034, or mute the thread https://github.com/notifications/unsubscribe-auth/ApvDht5EflH1L0Kw1ljDC9mHIXAnFwx6ks5vY5-vgaJpZM4cBYEo.

davidjamesbeck commented 5 years ago

I have the quotes and double quotes tests done and working with the new TranslationLine class and the Lokono translation line that breaks make passes (I pasted the line into the Chatino_FaultyAuthorExamples EAF). I actually don't understand the error raised by test_IjalLine.py, it looks to me that the assertion that throws the error is actually correct (that is, the line is what the assertion says it is). I added a raise Exception to the text file and get the following output

iMac-2:tests David$ make python test_MorphemeGloss.py --- test_inferno --- test_toHTML_sampleLine_0 --- test_toHTML_sampleLine_1 --- test_toHTML_sampleLine_2 --- test_toHTML_sampleLine_3 --- test_toHTML_sampleLine_4 --- test_toHTML_sampleLine_5 python test_IjalLine.py --- test_buildTable --- test_lokono_line_3 Traceback (most recent call last): File "test_IjalLine.py", line 85, in test_lokono_line_3 assert(x3.getTranslation() == "‘[a] child, a woman as well.'") AssertionError

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "test_IjalLine.py", line 405, in runTests() File "test_IjalLine.py", line 16, in runTests test_lokono_line_3() # each morpheme and gloss are separate xml tier elements File "test_IjalLine.py", line 87, in test_lokono_line_3 raise Exception(x3.getTranslation()) from e Exception: ‘[a] child, a woman as well.'’ make: *** [ijalLine] Error 1

The assertion says that the translation line is ‘[a] child, a woman as well.' and the raised exception says the line is ‘[a] child, a woman as well.' They look the same to me.

Anyway, this might disappear if we start using the TranslationLine class?

paul-shannon commented 5 years ago

@davidjamesbeck

Very nice work on the TranslationLine class and its tests. You are now a test-driven developer! Which requires more than a few conceptual leaps. Thank you for hanging on throughout the learning curve.

I just added an additional test to tests/test_TranslationLine.py to figure out the lokono line 3 problem reported by test_IjalLine.py.

I don't know if we will need to keep this test. I noticed that your version of this line in ChatinoFaulty does not include the troublesome straight single quote. But I think this test, reading directly from the Lokono text, is useful for:

paul-shannon commented 5 years ago

@davidjamesbeck I am not sure I understand your use of try/catch exceptions in your test_TranslationLine.py. My own minimalist approach is to use only assert, and to see the test fail when as assertion is False.

But maybe there is virtue in your approach. When you have a moment could you explain?

davidjamesbeck commented 5 years ago

Hi, Paul

I was using the exceptions so I could compare the output of the processes I was writing to the desired forms spelled out by the exceptions. For instance, when I was working on the thin spaces between quotes, I got AssertionErrors but all that was telling me was that I was messing up. By raising the exception and printing out the output from the process beign tested by teh assertion, I was able to see that I wasn’t escaping the thin space unicode characters correctly. Likewise, I wasn’t aware that string.strip() applied to thin space characters as well as normal space characters until I saw it in operation.

So sometimes it is useful—maybe more so for me than for someone who is better at figuring out what is going wrong from just working through the code. If I’d been using an IDE this wouldn’t have been necessary, either, but I’ve just been using IDLE for this.

David

On Mar 30, 2019, at 1:57 PM, Paul Shannon notifications@github.com wrote:

@davidjamesbeck https://github.com/davidjamesbeck I am not sure I understand your use of try/catch exceptions in your test_TranslationLine.py. My own minimalist approach is to use only assert, and to see the test fail when as assertion is False.

But maybe there is virtue in your approach. When you have a moment could you explain?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/paul-shannon/slexil/issues/15#issuecomment-478284339, or mute the thread https://github.com/notifications/unsubscribe-auth/ApvDhvm1i5TJbCjt9MpPFBumhZxaWPvlks5vb8G-gaJpZM4cBYEo.