openscriptures / morphhb

Open Scriptures Hebrew Bible
https://hb.openscriptures.org
Other
178 stars 64 forks source link

Distinguishing morphology and grammar #10

Open peterdanielmyers opened 10 years ago

peterdanielmyers commented 10 years ago

Just parsed שגיא, which is I believe a masculine form, though grammatically it modifies a feminine noun.

Rosenthal states "Adjective are place after the nouns to which they belong and to which they conform grammatically as closely as possible." (§41).

Is there any way to express "X in form but Y in meaning"?

dowens76 commented 10 years ago

As far as I know, there is not yet a way to distinguish form and meaning (I know WHM does this). This would require a change in the parsing scheme.

peterdanielmyers commented 10 years ago

Noticed the same problem again with פנים. Always a plural morphologically, but always denotes a singular referent.

All the more reason, I think, for me to make those adjustments to the parser so that we can have a more flexible database. The whole idea of morph codes is pretty 20th century, and rather than just emulating WHM badly, I think we should be designing the tool from the ground up to get the job done it needs to.

I'll be getting to that work next week, if not before.

yaaqov commented 10 years ago

For the morphological tagging to have academic merit (our target audience), it would need to simply reflect the word as it "is", not as it "ought" to be in the eyes of the data worker. The idea is that a scholar could find a fairly objective measure of seemingly "anomalous" orthographies by querying this data, and that a competent Hebrew scholar could assess whether there are indications in other texts of similar phenomena. For the data itself to try to jump to these conclusions I think is asking for more than the original scope requests.

For example, If a plural noun ('Elohim) is the subject of a singular verb (Bara'), or a typically feminine noun (ruaH [wind/spirit]) is described by a masculine adjective (ra` [evil/bad]), the tagging should say precisely that, and leave the conclusions of "plural of majesty" or "gender neutrality" to others.

On Mon, Apr 14, 2014 at 3:38 PM, peterdanielmyers notifications@github.comwrote:

Noticed the same problem again with פנים. Always a plural morphologically, but always denotes a singular referent.

All the more reason, I think, for me to make those adjustments to the parser so that we can have a more flexible database. The whole idea of morph codes is pretty 20th century, and rather than just emulating WHM badly, I think we should be designing the tool from the ground up to get the job done it needs to.

I'll be getting to that work next week, if not before.

— Reply to this email directly or view it on GitHubhttps://github.com/openscriptures/morphhb/issues/10#issuecomment-40408710 .

peterdanielmyers commented 10 years ago

Ya'aqov,

I think there's some discussion over the "target audience". But you've put your finger on precisely the reason why I'm raising these things as I parse: to flag them up so that a decision on consistency can be reached.

I'd suggest it's not quite as simple as you suggest. I don't think it's massively helpful to have אלהים always parsed as a plural for example. I think that someone searching on the database might legitimately search on that word to try and find all the places where it is used of pagan gods, and expect the tagging to reflect this.

There is no automatic mechanisation of this process. Morphological tagging is inherently subjective. That's why I've pushed for us to unite around a common standard of grammar, and one that reflects recent academic discussion on Hebrew.

yaaqov commented 10 years ago

I appreciate the focus on consistency, and without speaking, I hope my tone conveys respect for what is being discussed. Working with other semantic datasets in my professional life, big-picture judgments are kept away from the tagging process.

In tagging Congressional bills, for example, the name of a Representative will have district and a party, but we wouldn't trust a system that added fields for "moderate/leftist/right wing extremist". Political scientists would be able to judge the centrism of a bill by examining the more objective data points in such digitized primary sources.

Once the entire corpus has a baseline edition, future versions could have more advanced renderings, especially if we could be funded by a grant program, like many other digital humanities projects nowadays.

Perhaps we could schedule an online meeting to hash these issues out?

I wish you all the best, and will be offline for PesaH/Hag MaSSot.

Ya'aqov On Apr 14, 2014 4:03 PM, "peterdanielmyers" notifications@github.com wrote:

Ya'aqov,

I think there's some discussion over the "target audience". But you've put your finger on precisely the reason why I'm raising these things as I parse: to flag them up so that a decision on consistency can be reached.

I'd suggest it's not quite as simple as you suggest. I don't think it's massively helpful to have אלהים always parsed as a plural for example. I think that someone searching on the database might legitimately search on that word to try and find all the places where it is used of pagan gods, and expect the tagging to reflect this.

There is no automatic mechanisation of this process. Morphological tagging is inherently subjective. That's why I've pushed for us to unite around a common standard of grammar, and one that reflects recent academic discussion on Hebrew.

— Reply to this email directly or view it on GitHubhttps://github.com/openscriptures/morphhb/issues/10#issuecomment-40411234 .

peterdanielmyers commented 10 years ago

Hi Ya'aqov,

Nothing wrong with tone! Hope mine is ok too.

I'm quite new to this project too, so I'm not really in a senior position. I guess what I'd reflect back is that linguistics is subjective, even in the "basics" as you say. So it's not quite the same as keeping track of the details of candidates.

For example: RMW Dixon in his Basic Linguistic Theory argues—I think rightly—that linguistic typological categories like "noun" "adjective" "verb", etc. don't really have a consistent definition across languages (I am over simplifying him here somewhat). Even with something as simple as "number" of nouns: languages are different. Take the linguistic concept of "plural". In some languages, "plural" means "not singular" because the language only has two categories of number: singular and plural. But another language might have "dual" forms, and yet another "triple" forms. Between these languages, the definition of "plural" is different. It might mean "not singular" or "not singular or dual", etc.

To extend the observation further: a category like "dual" functions differently in different languages—and even at different stages within the same language. It might be a strong category where two items or pairs of items are consistently expressed using the dual number. Or, it might be a weak category, where there is lots of overlap between the dual and plural. In Biblical Hebrew we see the situation where some words belonging to particular semantic domains seem to be particularly attracted to the dual (body parts), whereas other words from other semantic domains will pretty much only use the singular or plural forms.

There are no concrete hermetically sealed categories in linguistics. If you read JM on many of the issues we've discussed, you'll see for example that Muraoka is often careful to talk about morphemes displaying different grammatical features simulataneously. Sometimes an infinitive displays its character as a noun, other times its character as a verb is more prominent. When parsing, this is particularly relevant for the supposed category of "adjectives". What is an "adjective" in Biblical Hebrew? Many that have derived from stative verbs still sometimes look like stative verbs. What about adverbs too? The word מאד: we would translate its force into English using an adverb, but it's really a noun. Occasionally it modifies another substantive—in which case it looks more like an adjective.

That's what I mean by linguistics being subjective. You can't parse with mathematical precision, and I'm not convinced that leaving all these questions open for discussion is massively helpful to the project. Over the last few days I've parsed Genesis 1; Obadiah, the Aramaic bits in Genesis 31:47 and Jer 10:11; and some parts of Daniel 2 and Gen 2. Doing that parsing, I come across all sorts of subjective questions like the ones I've been raising on github. They are unresolved, nobody has replied to most of them, and I just have to move on and carry on parsing. I'd much rather just get on with parsing than talking about it! It's a little bit of a waste of time, really, because I'm not sure I've been consistent in what I've done, or always remembered my decisions, so I'm spending my time creating work that has to be corrected.

That's why, because this is a subjective process, we can't just mechanically process the words, and it's also why I think we need some kind of reference text. If I have a question on Hebrew, I pretty much know that Muraoka has a considered opinion, similar for Rosenthal on Aramaic. By using them as a reference, hopefully the decisions made will have a consistency to them.