Strong's Numbers, Strong's Plus, etc.

jonathanrobie commented 6 years ago

The original Abbott-Smith did not use Strong's numbers. This edition uses Strong's numbers and includes an extended form of Strong's numbers created by Alan Bunning. This is one of several approaches to extending Strong's numbers.

James Tauber's morphological lexicon collates the most common schemes for lemmatization here:

https://raw.githubusercontent.com/morphgnt/morphological-lexicon/master/lexemes.yaml

biblicalhumanities.org would like to see one scheme for lemmatization adopted widely, and I think James Tauber has probably done the most work on this particular issue.

destatez commented 6 years ago

In that file, what does the gk field stand for, where does it get its values? It starts as matching strongs, but then diverges

jonathanrobie commented 6 years ago

I assume these are Goodrick-Kohlenberger numbers. I'll point James at this issue too.

jonathanrobie commented 6 years ago

A comment on what you said here:

https://github.com/translatable-exegetical-tools/Abbott-Smith/issues/91#issuecomment-366950449

If it's called a Strong's number in the markup, I think it should be a Strong's number. The world is littered with things that claim to be Strong's numbers and are not because individuals decided to "improve" the scheme and did not always even document what they were changing. That means that if you find a Strong's number in one resource and look it up in another resource, it does not work.

There are several "extended" Strong's numbers, but most of these are not designed to be extended to a larger corpus that might include, for instance, the Church Fathers. That's a problem if you want Catenae (a resource that shows where each passage is quoted and commented on in the Church Fathers) or similar resources. And for linguistic work, we also want the Papyrii and contemporary Hellenistic literature.

There is clearly a need for identifiers that cover all known lexemes, and that's what James is working on. That's a need that spans resources and projects.

destatez commented 6 years ago

Check out link https://hermeneutics.stackexchange.com/questions/20024/what-do-the-goodrick-kohlenberger-numbers-represent-what-features-does-this-s . It does not look like there is a big following of this numbering scheme. Since most folk are on board with Strongs, Alan's idea may be a better option.

destatez commented 6 years ago

Alan's scheme IS bridging other Greek resources. I believe that that was his maIn driver

jag3773 commented 6 years ago

@jonathanrobie What exactly are you proposing? Are you suggesting that we should put the original Strong's numbers into A-S ?

The link you posted doesn't appear to provide a distinct scheme, it provides a disambiguation file that would allow you to cross reference multiple resources. Included in that would be the Strong's numbering scheme from Alan, since it doesn't break compatibility with the original Strong's numbers.

jonathanrobie commented 6 years ago

@jag3773 There are multiple possible solutions, but I would like to satisfy these requirements:

If A-S has numbers called Strong's numbers, they should not break compatibility.
For identifiers not covered by Strong, we should use the same solution so that the things biblicalhumanities.org is working on are compatible with the things you are working on. Especially since there are some resources, like this one, that we both work on. On our side, James Tauber is the morphology lead.

These numbers actually occur inside the lexicon, and I want cross-reference to work across all source language resources.

jonathanrobie commented 6 years ago

Since most folk are on board with Strongs, Alan's idea may be a better option.

I don't think we should use GK numbers, they are copyrighted and we cannot extend them. But currently, nobody is using Alan's system, and we have two groups that are each investing in different approaches that are not compatible. So what I want is to carefully consider how best to move forward, and I would like James, Ulrik, and Alan to be part of the conversation so that we can get the stakeholders on the same page.

I want one solution, but I'm not invested in any particular solution.

destatez commented 6 years ago

I ran the following proposal by Todd and he gives it a thumbs up. Jonathan, what is your opinion on this proposal? And you, Jesse @jag3773 ?

I'm proposing that we re-write the baseline A-S xml with Strongs Plus IDs and manually update the undefined G??? IDs to what Daniel has determined, but using the Strongs Plus, not the standard Strongs followed by a letter. We can then re-baseline that for historical reasons. I can then re-run my latest script to reformat the xml with the 2 new attributes and save that back into the original filename and baseline that as the reference moving forward. A slight spin on this would be to use the attribute name "strongsplus" instead of "strong" to make sure that users of the xml understand this modified numbering scheme.

jonathanrobie commented 6 years ago

I would like to hear from James and Ulrik and Alan before making a decision. Making the right decision is probably more important than making a quick decision, though we shouldn't dawdle either.

I see several areas of incompatibility with Strong's numbers in the enhanced numbering scheme described in Alan's project description here: https://greekcntr.org/downloads/project.pdf, I would like to think carefully about how that affects compatibility. Most resources will be indexed to traditional Strong's numbers.

Would it be helpful to use the prefix to clearly flag a number that differs from Strong's? For instance, we could use a B to indicate a Bunning number and the traditional G or H to indicate a Strong's number?

In the long run, I think we need something that can handle any Greek or Hebrew word.

jtauber commented 6 years ago

Ulrik and I argued in our 2006 paper[1] (and me again in my SBL 2017 talk[2]) why a single number-to-lexeme is insufficient and problematic. That is not to say we can't secondarily reference Strong's numbers (or Bunning numbers/ESN) even with their limitations, but that's just part of the story and shouldn't be the ultimate identifier. Other perfectly valid schemes to "reference" to are A-S headwords, BDAG headwords, or LSJ headwords but each by themselves has problems. My proposal has always been a system that recognises that all these resources differ in how they lump and split lexemes and to avoid the myth there is only ONE RIGHT WAY of doing this.

One thing we absolutely should avoid is conflating multiple words under a single Strong's number and still claiming they are Strongs. (This alone renders many resources that "use Strong's numbers" partially useless)

As part of the work I'm doing with the Perseus project, I'm working on this issue for the entire Perseus corpus and I'm working with others to extend it to Greek papyri as well.

To the extent that people have linked together lexemes that have been lumped by some and split my others, this information is useful (if made freely available in machine-actionable form)

Claiming ONE TRUE LUMPING/SPLITTING is a lot less helpful.

[1] https://www.academia.edu/19660777/A_New_Numbering_System_for_Greek_New_Testament_Lexemes_2006_ [2] https://vimeo.com/243936959

destatez commented 6 years ago

The idea of having unique prefixes for non-standard Strongs appeals to me. From a tooling standpoint, particularly with our Greek lexicon, this has some significant tooling and configuration issues. It says that we should intelligently drop the 5-digit number back down to 4, but that number is used for folder names and is embedded in every lemma file, for its own reference and for any link to other lemma files (folders). I would want to get the Greek New Testament tooling person in the loop to determine the impact there. Even with Alan's spreadsheet being the driver for that, it should not be that difficult to reformat his 5-digit numbers to 4-digit where the one's digit is zero, and using a different prefix for cases where that is not zero.

(Can we add Todd Price as a participant and get his input on this aspect?)

@jag3773 Even with my work on the tool to create the initial Hebrew and Aramaic lexicon, what are your thoughts on this numbering scheme topic? There was some earlier discussion about the "standard" Strongs not fitting with both the Hebrew and Aramaic. Would the standard 4-digit Strongs fit for the Hebrew side, specifically? Could you use a different prefix for the Aramaic when it does fit the standard Strongs? (My exposure to Aramaic is close to nil)

jonathanrobie commented 6 years ago

To me, the most important immediate issue is that many resources use Strong's numbers, not just Alan's GNT, so we want to make sure that bog standard Strong's numbers are supported.

But over time, we also need a better way of doing lemmas in addition to this. Consider entries like <entry n="λέγω|G2036|G2046|G3004|G4483">, <entry n="εἰ|G1487|G1489|G1490|G1499|G1508|G1509|G1512|G1513">, <entry n="αὐτός|G846|G4571|G4671|G4675|G5209|G5210|G5213"> and <entry n="ἐγώ|G1473|G1691|G1698|G1700|G2248|G2249|G2254|G2257|G3165|G3427|G3450">. I'm not sure there's an easy short-term fix, but these clearly indicate that what we have is not an identifier for a lexeme.

destatez commented 6 years ago

I agree with point one. That was why I was proposing that we use only standard Strongs with the G (or H) prefix and the for any of Alan's "long" Strongs we would use a 5-digit number to match his, but preface that with the letter B (Bunning)

I had forgotten about the second point. For those cases, when I was generating the ugl files, I treated them as undefined Strongs to have the ugl team determine what the correct answer was. I was assuming that there was only 1 Strongs ID for each Greek word.

translatable-exegetical-tools / Abbott-Smith

Strong's Numbers, Strong's Plus, etc. #93