Scan omitted Grammar tagging in many instances

translatable-exegetical-tools / Abbott-Smith

Abbott-Smith's Manual Greek Lexicon

32 stars 19 forks source link

Scan omitted Grammar tagging in many instances #60

Closed destatez closed 7 years ago

destatez commented 7 years ago

We should identify below, all of the grammar abbreviations that occur which should have the grammar tagging around them. e.g. adv., for an adverb. A script should be able to be developed which can do a global replace (inclusion of the tagging) for each instance that is not already tagged. The list of these can be extracted from section "I. GENERAL." at the beginning of the XML file.

Most of the current instances of tagging occur after the tag-pair and the tag-pair and before the first tag-pair, but there are also current instances that a a part of the contents of a tag-pair. A decision will need to made when developing and running this script, whether the "replacements" should only before the tag-pair or whether they should be "replaced" wherever they occur.

cbearden commented 7 years ago

Hi David,

This sounds good. I have a pair of scripts ('scripts/find_foreign.py' and 'scripts/fix_foreign.py') that I used to find segments of Greek & Hebrew text that weren't enclosed in and to add the tags. Possibly they could be adapted to this purpose as well. I may not get to that immediately, so others may beat me to the punch with a different approach.

I think there are a number of ways we could use scripts or XQuery to make the analysis and fixing of the markup faster. My immediate focus is on making the document valid TEI/OSIS again.

All the best, Chuck

On Thu, Nov 24, 2016 at 2:18 PM, David Statezni notifications@github.com wrote:

We should identify below, all of the grammar abbreviations that occur which should have the grammar tagging around them. e.g. adv., for an adverb. A script should be able to be developed which can do a global replace (inclusion of the tagging) for each instance that is not already tagged. The list of these can be extracted from the frontal material.

Most of the current instances of tagging occur after the tag-pair and the tag-pair and before the first tag-pair, but there are also current instances that a a part of the contents of a tag-pair. A decision will need to made when developing and running this script, whether the "replacements" should only before the tag-pair or whether they should be "replaced" wherever they occur.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/translatable-exegetical-tools/Abbott-Smith/issues/60, or mute the thread https://github.com/notifications/unsubscribe-auth/AAaEFpYzXKuN2WnlTwc30m5BGEpyVZOXks5rBfEpgaJpZM4K78St .

destatez commented 7 years ago

Charles

That sounds like a plan. I have been using perl to do all sorts of global replacements for the ULB, UDB, Notes, tW, etc. Either tool can do the job. My thoughts on this particular topic and Issue 59, were to wait until all manual editing is complete and use the scripts to "catch" any that were missed by the editors.

Dave

On Thu, Nov 24, 2016 at 5:27 PM, Charles Bearden notifications@github.com wrote:

Hi David,

This sounds good. I have a pair of scripts ('scripts/find_foreign.py' and 'scripts/fix_foreign.py') that I used to find segments of Greek & Hebrew text that weren't enclosed in and to add the tags. Possibly they could be adapted to this purpose as well. I may not get to that immediately, so others may beat me to the punch with a different approach.

I think there are a number of ways we could use scripts or XQuery to make the analysis and fixing of the markup faster. My immediate focus is on making the document valid TEI/OSIS again.

All the best, Chuck

On Thu, Nov 24, 2016 at 2:18 PM, David Statezni notifications@github.com wrote:

We should identify below, all of the grammar abbreviations that occur which should have the grammar tagging around them. e.g. adv., for an adverb. A script should be able to be developed which can do a global replace (inclusion of the tagging) for each instance that is not already tagged. The list of these can be extracted from the frontal material.

Most of the current instances of tagging occur after the tag-pair and the tag-pair and before the first tag-pair, but there are also current instances that a a part of the contents of a tag-pair. A decision will need to made when developing and running this script, whether the "replacements" should only before the tag-pair or whether they should be "replaced" wherever they occur.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <https://github.com/translatable-exegetical-tools/Abbott-Smith/issues/60 , or mute the thread https://github.com/notifications/unsubscribe-auth/ AAaEFpYzXKuN2WnlTwc30m5BGEpyVZOXks5rBfEpgaJpZM4K78St .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/translatable-exegetical-tools/Abbott-Smith/issues/60#issuecomment-262859288, or mute the thread https://github.com/notifications/unsubscribe-auth/AQAi7-cDL0zaXuULRaQDfVHM-R-z22K-ks5rBitYgaJpZM4K78St .

cbearden commented 7 years ago

Hi Dave,

Would it be good to have a channel for general communications about the project, so as not to overload the Github 'issues' feature with more general topics? I don't know any way to contact you other than responding to this issue.

There is a Google Group ("TExT: Abbott-Smith Project"), but the last posts in it were from me, about my efforts to tag Greek & Hebrew with , about a year ago. For instance, I don't know anything about the work of manual review that is evidently going on (which is great news!).

I'd like to get the XML file into valid shape, but I don't want to make life harder for those trying to merge my work with the results of their manual review. Also, I think we'll need to discuss some markup choices.

Would it make sense to use the Google Group for general coordination and discussion, or is there another, better channel?

All the best, Chuck

On Thu, Nov 24, 2016 at 7:02 PM, David Statezni notifications@github.com wrote:

Charles

That sounds like a plan. I have been using perl to do all sorts of global replacements for the ULB, UDB, Notes, tW, etc. Either tool can do the job. My thoughts on this particular topic and Issue 59, were to wait until all manual editing is complete and use the scripts to "catch" any that were missed by the editors.

Dave

On Thu, Nov 24, 2016 at 5:27 PM, Charles Bearden <notifications@github.com

wrote:

Hi David,

This sounds good. I have a pair of scripts ('scripts/find_foreign.py' and 'scripts/fix_foreign.py') that I used to find segments of Greek & Hebrew text that weren't enclosed in and to add the tags. Possibly they could be adapted to this purpose as well. I may not get to that immediately, so others may beat me to the punch with a different approach.

I think there are a number of ways we could use scripts or XQuery to make the analysis and fixing of the markup faster. My immediate focus is on making the document valid TEI/OSIS again.

All the best, Chuck

On Thu, Nov 24, 2016 at 2:18 PM, David Statezni < notifications@github.com> wrote:

We should identify below, all of the grammar abbreviations that occur which should have the grammar tagging around them. e.g. adv., for an adverb. A script should be able to be developed which can do a global replace (inclusion of the tagging) for each instance that is not already tagged. The list of these can be extracted from the frontal material.

Most of the current instances of tagging occur after the tag-pair and the tag-pair and before the first tag-pair, but there are also current instances that a a part of the contents of a tag-pair. A decision will need to made when developing and running this script, whether the "replacements" should only before the tag-pair or whether they should be "replaced" wherever they occur.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <https://github.com/translatable-exegetical-tools/ Abbott-Smith/issues/60 , or mute the thread https://github.com/notifications/unsubscribe-auth/ AAaEFpYzXKuN2WnlTwc30m5BGEpyVZOXks5rBfEpgaJpZM4K78St .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/translatable-exegetical-tools/ Abbott-Smith/issues/60#issuecomment-262859288, or mute the thread https://github.com/notifications/unsubscribe- auth/AQAi7-cDL0zaXuULRaQDfVHM-R-z22K-ks5rBitYgaJpZM4K78St .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/translatable-exegetical-tools/Abbott-Smith/issues/60#issuecomment-262861621, or mute the thread https://github.com/notifications/unsubscribe-auth/AAaEFrHkjuVkkc_JnjnERffTnINe-zn7ks5rBjOigaJpZM4K78St .

destatez commented 7 years ago

Charles

I just got connected to that Google Group. That sounds like a good means of communications. We really need to get Chapel and possibly Todd connected to it, since they are the leads. I cc'd then on this reply. I am just an editor and tool-guy.

Dave

On Fri, Nov 25, 2016 at 5:54 PM, Charles Bearden notifications@github.com wrote:

Hi Dave,

Would it be good to have a channel for general communications about the project, so as not to overload the Github 'issues' feature with more general topics? I don't know any way to contact you other than responding to this issue.

There is a Google Group ("TExT: Abbott-Smith Project"), but the last posts in it were from me, about my efforts to tag Greek & Hebrew with , about a year ago. For instance, I don't know anything about the work of manual review that is evidently going on (which is great news!).

I'd like to get the XML file into valid shape, but I don't want to make life harder for those trying to merge my work with the results of their manual review. Also, I think we'll need to discuss some markup choices.

Would it make sense to use the Google Group for general coordination and discussion, or is there another, better channel?

All the best, Chuck

On Thu, Nov 24, 2016 at 7:02 PM, David Statezni notifications@github.com wrote:

Charles

That sounds like a plan. I have been using perl to do all sorts of global replacements for the ULB, UDB, Notes, tW, etc. Either tool can do the job. My thoughts on this particular topic and Issue 59, were to wait until all manual editing is complete and use the scripts to "catch" any that were missed by the editors.

Dave

On Thu, Nov 24, 2016 at 5:27 PM, Charles Bearden < notifications@github.com

wrote:

Hi David,

This sounds good. I have a pair of scripts ('scripts/find_foreign.py' and 'scripts/fix_foreign.py') that I used to find segments of Greek & Hebrew text that weren't enclosed in and to add the tags. Possibly they could be adapted to this purpose as well. I may not get to that immediately, so others may beat me to the punch with a different approach.

I think there are a number of ways we could use scripts or XQuery to make the analysis and fixing of the markup faster. My immediate focus is on making the document valid TEI/OSIS again.

All the best, Chuck

On Thu, Nov 24, 2016 at 2:18 PM, David Statezni < notifications@github.com> wrote:

We should identify below, all of the grammar abbreviations that occur which should have the grammar tagging around them. e.g. adv., for an adverb. A script should be able to be developed which can do a global replace (inclusion of the tagging) for each instance that is not already tagged. The list of these can be extracted from the frontal material.

Most of the current instances of tagging occur after the tag-pair and the tag-pair and before the first tag-pair, but there are also current instances that a a part of the contents of a tag-pair. A decision will need to made when developing and running this script, whether the "replacements" should only before the tag-pair or whether they should be "replaced" wherever they occur.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <https://github.com/translatable-exegetical-tools/ Abbott-Smith/issues/60 , or mute the thread https://github.com/notifications/unsubscribe-auth/ AAaEFpYzXKuN2WnlTwc30m5BGEpyVZOXks5rBfEpgaJpZM4K78St .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/translatable-exegetical-tools/ Abbott-Smith/issues/60#issuecomment-262859288, or mute the thread https://github.com/notifications/unsubscribe- auth/AQAi7-cDL0zaXuULRaQDfVHM-R-z22K-ks5rBitYgaJpZM4K78St .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/translatable-exegetical-tools/ Abbott-Smith/issues/60#issuecomment-262861621, or mute the thread https://github.com/notifications/unsubscribe-auth/AAaEFrHkjuVkkc_ JnjnERffTnINe-zn7ks5rBjOigaJpZM4K78St .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/translatable-exegetical-tools/Abbott-Smith/issues/60#issuecomment-263036273, or mute the thread https://github.com/notifications/unsubscribe-auth/AQAi7wqOE0uxH7DT2JjL3ANVH0PASNStks5rB4M4gaJpZM4K78St .

destatez commented 7 years ago

Charles

It's taking some time to get approved for that Google Group, though I thought that I had received a message that I was. So, I can't answer you via a post against your latest topic. You can either wait until I get approved, or you can pass me your email address and I can send to a message about what the editors are doing. Your pick

Dave

On Thu, Nov 24, 2016 at 6:02 PM, David Statezni dave@statezni.com wrote:

Charles

That sounds like a plan. I have been using perl to do all sorts of global replacements for the ULB, UDB, Notes, tW, etc. Either tool can do the job. My thoughts on this particular topic and Issue 59, were to wait until all manual editing is complete and use the scripts to "catch" any that were missed by the editors.

Dave

On Thu, Nov 24, 2016 at 5:27 PM, Charles Bearden <notifications@github.com

wrote:

Hi David,

This sounds good. I have a pair of scripts ('scripts/find_foreign.py' and 'scripts/fix_foreign.py') that I used to find segments of Greek & Hebrew text that weren't enclosed in and to add the tags. Possibly they could be adapted to this purpose as well. I may not get to that immediately, so others may beat me to the punch with a different approach.

I think there are a number of ways we could use scripts or XQuery to make the analysis and fixing of the markup faster. My immediate focus is on making the document valid TEI/OSIS again.

All the best, Chuck

On Thu, Nov 24, 2016 at 2:18 PM, David Statezni <notifications@github.com

wrote:

We should identify below, all of the grammar abbreviations that occur which should have the grammar tagging around them. e.g. adv., for an adverb. A script should be able to be developed which can do a global replace (inclusion of the tagging) for each instance that is not already tagged. The list of these can be extracted from the frontal material.

Most of the current instances of tagging occur after the tag-pair and the tag-pair and before the first tag-pair, but there are also current instances that a a part of the contents of a tag-pair. A decision will need to made when developing and running this script, whether the "replacements" should only before the tag-pair or whether they should be "replaced" wherever they occur.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/translatable-exegetical-tools/Abbott- Smith/issues/60, or mute the thread https://github.com/notifications/unsubscribe-auth/AAaEFpYzX KuN2WnlTwc30m5BGEpyVZOXks5rBfEpgaJpZM4K78St .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/translatable-exegetical-tools/Abbott-Smith/issues/60#issuecomment-262859288, or mute the thread https://github.com/notifications/unsubscribe-auth/AQAi7-cDL0zaXuULRaQDfVHM-R-z22K-ks5rBitYgaJpZM4K78St .

dowens76 commented 7 years ago

Dave, you've already been approved for the group using your Gmail address. I approved you almost immediately. Try sending an email to text-abbott-smith-project@googlegroups.com.

cbearden commented 7 years ago

Hi Dave,

I was able to see your post to the group with the subject "Group Acceptance". Looks like you are able to post now. If you didn't get a copy of the reply in your email inbox, perhaps you just need to edit your email preference settings for the group.

I'm looking forward to hearing about what's going on with the dictionary. I see you're with Wycliffe, which is very cool.

All the best, Chuck

On Sat, Nov 26, 2016 at 6:46 PM, David Statezni notifications@github.com wrote:

Charles

It's taking some time to get approved for that Google Group, though I thought that I had received a message that I was. So, I can't answer you via a post against your latest topic. You can either wait until I get approved, or you can pass me your email address and I can send to a message about what the editors are doing. Your pick

Dave

On Thu, Nov 24, 2016 at 6:02 PM, David Statezni dave@statezni.com wrote:

Charles

That sounds like a plan. I have been using perl to do all sorts of global replacements for the ULB, UDB, Notes, tW, etc. Either tool can do the job. My thoughts on this particular topic and Issue 59, were to wait until all manual editing is complete and use the scripts to "catch" any that were missed by the editors.

Dave

On Thu, Nov 24, 2016 at 5:27 PM, Charles Bearden < notifications@github.com

wrote:

Hi David,

This sounds good. I have a pair of scripts ('scripts/find_foreign.py' and 'scripts/fix_foreign.py') that I used to find segments of Greek & Hebrew text that weren't enclosed in and to add the tags. Possibly they could be adapted to this purpose as well. I may not get to that immediately, so others may beat me to the punch with a different approach.

I think there are a number of ways we could use scripts or XQuery to make the analysis and fixing of the markup faster. My immediate focus is on making the document valid TEI/OSIS again.

All the best, Chuck

On Thu, Nov 24, 2016 at 2:18 PM, David Statezni < notifications@github.com

wrote:

We should identify below, all of the grammar abbreviations that occur which should have the grammar tagging around them. e.g. adv., for an adverb. A script should be able to be developed which can do a global replace (inclusion of the tagging) for each instance that is not already tagged. The list of these can be extracted from the frontal material.

Most of the current instances of tagging occur after the tag-pair and the tag-pair and before the first tag-pair, but there are also current instances that a a part of the contents of a tag-pair. A decision will need to made when developing and running this script, whether the "replacements" should only before the tag-pair or whether they should be "replaced" wherever they occur.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/translatable-exegetical-tools/Abbott- Smith/issues/60, or mute the thread https://github.com/notifications/unsubscribe-auth/AAaEFpYzX KuN2WnlTwc30m5BGEpyVZOXks5rBfEpgaJpZM4K78St .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/translatable-exegetical-tools/ Abbott-Smith/issues/60#issuecomment-262859288, or mute the thread https://github.com/notifications/unsubscribe- auth/AQAi7-cDL0zaXuULRaQDfVHM-R-z22K-ks5rBitYgaJpZM4K78St .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/translatable-exegetical-tools/Abbott-Smith/issues/60#issuecomment-263094987, or mute the thread https://github.com/notifications/unsubscribe-auth/AAaEFhnJTNgXWJ0cb_lVkZYUeFMg4S0Aks5rCNLQgaJpZM4K78St .

toddlprice commented 7 years ago

Re: par. 2 of the 1st post: Yes, I do think that the grammar abbreviations even in the Sense sections should be tagged. This might be a bit beyond the original scope of making a digital representation of A-S, so perhaps this should wait until Stage 2 and be considered part of the UGL. What I mean is that I see use for it where the grammar tags in UGL can be linked to UGG so that these grammatical concepts are explained in our Grammar. That is beyond the Stage 1 goal.

toddlprice commented 7 years ago

Just to clarify, as part of digitizing A-S, we do want the grammar abbreviations to have tagging around them. This is valid and needed for stage 1. But linking those tags to UGG needs to wait until stage 2.

destatez commented 7 years ago

I have run across an issue on this topic. I have done searches of the XML looking for the POS "keywords" and have found instances of these that are a part of a description, as well as what I would call viable instances. I have attached some examples of the search output and need a little clarification on what should be and what shouldn't be tagged. The keywords that I used were as follows. The search would find any word that started with the keyword. That was why I had to qualify some to preclude others from appearing in the search. adj, adv, article, conj, interj, num, part, prep, pron, subst, art. (and NOT article), super (and NOT superscript), noun (and NOT pron), verb (and NOT adv)

Non-tagged-POS.txt

toddlprice commented 7 years ago

I think the examples in your txt file (verb, part and art) should not be tagged. It looks like ptcp. should be tagged since it is used in lexical entries rather than in 'running text'.

destatez commented 7 years ago

I am concerned about the current state of the pos tags in A-S. There are currently 53 different ”values” that are tagged in the XML (see A_S_XML_pos_instance_text.txt). {I combined instances that were abbreviations or variations of abbreviations for those listed} There are total of 357 instances where these are tagged, with 29 of these being within the sense data (see A_S_pos_sense_Instances.txt). The remainder are within the orth data or etym data, which is where I would have expected them. My questions, as relates to automating the tagging of the XML file are:

Should I tag only instances that are in within the orth or etym data, or should I also include the instances in the sense data?
What text should I search for to do this tagging? {I put the list from the Issue in file: Possible_pos_values.txt, where I moved article, part, and verb to the DO NOT include list.} Could you review this list and move any other values to the DO NOT include list that you believe I should “ignore” for this tagging.

A_S_XML_pos_instance_text.txt

A_S_pos_sense_Instances.txt

Possible_pos_values.txt

toddlprice commented 7 years ago

Only tag what is in orth and etmy data.

destatez commented 7 years ago

Updated XML with only13 changes needed, when scope was reduced to orth & etym