translatable-exegetical-tools / Abbott-Smith

Abbott-Smith's Manual Greek Lexicon
31 stars 19 forks source link

This is no longer a valid XML file #101

Closed jonathanrobie closed 3 years ago

jonathanrobie commented 3 years ago

This used to be valid TEI P5 XML. It no longer validates. It should be fixed.

destatez commented 3 years ago

Jonathanrobie

Do you know the date of the last successful validation? We can then have GitHub give us the changes since then to try to locate the problem

jonathanrobie commented 3 years ago

I don't actually know. I had my own fork, which is still valid: https://github.com/biblicalhumanities/Abbott-Smith

destatez commented 3 years ago

I will work on getting a git dif between your fork and the baseline. I’m not that close to XML. What tool do I use to validate it, and against what template/schema? Please provide a full link to the later so I’m sure I’m validating against what you are using.

destatez commented 3 years ago

I cloned the tet master to a local fork. I then did a zip clone of your fork. Since we are dealing with a single text file, I used a text compare program, UltraCompare, on the abbott-smith.tei.xml files in the two folders. I created an HTML difference report for Differences only. That was done at the word level and ignored line terminators. There were ONLY 16,753 lines that were different from the 58409 total. I'm not sure where to go from here, since a lot of the differences were added , , , ,Hebrew words where the n= was added. Maybe the best next step is to run the master through the validation and see where the errors come in. From your discussion, it would seem that you had done this to show that it wouldn't validate. Is there a way to get me that validation report to assess what would have to be altered in the master? You can get my email from my profile if that's the easiest way to get that to me.

jonathanrobie commented 3 years ago

Could we have a pre-commit hook that validates when someone checks in changes and rejects anything that does not validate?

The easiest way to fix it now might be to add the changes to the valid file, rather than try to make the invalid file valid. And I think you will need an XML editor like oXygen or an XML editing environment like the one in Emacs or whatever editor you use. An XML editor supports validation, and an editor like oXygen can help you visualize the structure, which is especially helpful if you haven't done lots of XML editing. For instance, this is easier to visualize than a bunch of tags if you haven't done a lot of XML:

image

The README.md tells you where to find the schema:

All text from the lexicon is marked up using CrossWire.org's iteration of TEI XML, which supports several features of OSIS XML that are relevant to biblical studies (especially biblical references). For helpful documentation on this iteration of TEI, see http://www.crosswire.org/wiki/TEI_Dictionaries. For the schema definition, see http://www.crosswire.org/OSIS/teiP5osis.1.4.xsd. For detailed documentation on TEI dictionaries, see http://www.tei-c.org/release/doc/tei-p5-doc/en/html/DI.html.

You can also see the schema information in the first few lines of the XML file itself:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="./releases/abbott-smith.xsl"?>
<TEI xmlns="http://www.crosswire.org/2013/TEIOSIS/namespace"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.crosswire.org/2013/TEIOSIS/namespace
        http://www.crosswire.org/OSIS/teiP5osis.2.5.0.xsd">
cbearden commented 3 years ago

Hi all,

The last commit in master that validates is from 2016/03/11: c06886cea495a3dc34ee3d17bce1499e61eca1d0 . Since that time there have been 172 further commits by my count.

On 2018/02/19 (74413a6d93930576e5307d8971d21ad48f58fe1c), a second, parallel XML file was introduced into the repo, namely abbott-smith.tei_lemma.xml. It mirrors abbott-smith.tei.xml exactly apart from using the attributes @lemma and @strong in the entry element but not @n. For that reason, it will never validate. However, as long as the values for those two new attributes remain consistent, it would be a simple matter to store those values in an XML file and use an XSLT to generate the second "lemma" file from the first file. This approach would also solve the problem of the two files getting out of sync. There were a few such discrepancies, but I was able to find and fix them earlier this year.

I'm not certain what the reason for the second file was, but I seem to recall there was some specific application those changes were meant to support. Dave might know.

Are we content with the way the text was marked up back when it was valid? One thing I recall having questions about was that sometimes LXX usage was marked up as etym/seg , when it wasn't an etymological note. See for example the entry for ἀγαθοποιέω. There is also a usg element, and like etym it may be contained by entry and it may contain seg. I realize it is possible to pursue such matters well into the weeds, but this one seemed worth talking about.

TEI P5 docs on dictionaries are here:

https://tei-c.org/release/doc/tei-p5-doc/en/html/DI.html

I have been interested in making the file valid TEI/OSIS again, but it has seemed like a daunting prospect.

All the best, Chuck

jonathanrobie commented 3 years ago

On 2018/02/19 (74413a6), a second, parallel XML file was introduced into the repo, namely abbott-smith.tei_lemma.xml. It mirrors abbott-smith.tei.xml exactly apart from using the attributes @lemma and @strong in the entry element but not @n. For that reason, it will never validate. However, as long as the values for those two new attributes remain consistent, it would be a simple matter to store those values in an XML file and use an XSLT to generate the second "lemma" file from the first file.

We shouldn't ever check in an XML file that does not validate.

I think your proposed changes are good ones. People can generate various things from the main XML.

Are we content with the way the text was marked up back when it was valid? One thing I recall having questions about was that sometimes LXX usage was marked up as etym/seg , when it wasn't an etymological note. See for example the entry for ἀγαθοποιέω. There is also a usg element, and like etym it may be contained by entry and it may contain seg. I realize it is possible to pursue such matters well into the weeds, but this one seemed worth talking about.

From my perspective, we really should revert to the last valid instance. Any changes to the structure need to validate, we can restructure AFTER we get it to validate again.

I have been interested in making the file valid TEI/OSIS again, but it has seemed like a daunting prospect.

Yeah, I hear you. But I think it might be manageable if we do it this way:

cbearden commented 3 years ago

I created a fresh clone of the repo, generated a backwards patch from the current head of master back to the last valid commit (git diff 370ca9c2a851531c29bb2d9312fc3919fc35eea9 c06886cea495a3dc34ee3d17bce1499e61eca1d0 abbott-smith.tei.xml > backwards.patch), applied the patch (patch -p1 < backwards.patch), verified that its md5sum matched that of the last valid commit, and used meld to compare the head of master with the last valid state. I have attached a screenshot of the first screen of Α (I find meld a lot easier to make sense of than diff output). Here are the kinds of changes in that small sample:

One error introduced:

(v. Swete on Mk, I.c.)

in the last valid file is correct vs

(v. Swete on <ref osisRef="Mark.14.36">Mk 14:36</ref>)

in HEAD, which may have been introduced by a script, since the following text is also a reference to Mk 14:36.

The substitution of <emph> for <gloss> is an editorial change, and there may be others, so the going may be slow if we consider each of these. Using BaseX and XQuery, I count 2557 <gloss> elements and 2130 <emph> elements in the last valid file, while in the head of master I count 12,536 <gloss> elements and 2512 <emph> elements, so a good number of further typographical and semantic details have been encoded.

backwards_abba

jonathanrobie commented 3 years ago

Most of the errors seem to involve:

  1. <seg/> elements where they are not allowed.
  2. <gramGrp/> elements where they are not allowed.
  3. Character data directly inside an <entry/> element, which is element-only.
  4. Some <foreign/> elements where they are not allowed.

It looks like this restructuring was done without validation.

The substitution of for is an editorial change

It changes the results of a query that looks for the glosses associated with a sense. That's a common operation. In some cases, this change seems to have taken genuine glosses and demoted them.

Would a Zoom call be helpful?

jonathanrobie commented 3 years ago

@dowens76 Chuck has submitted a pull request that fixes all but 8 validation errors:

https://github.com/translatable-exegetical-tools/Abbott-Smith/pull/103

After we get the last 8 fixed, I recommend adding a Git hook to validate before accepting a commit.

jonathanrobie commented 3 years ago

Chuck has now submitted a pull request that fixes all errors.

If nobody objects before tomorrow, either Chuck or I will accept the pull request. If nobody objects before tomorrow, per email discussion, I will also consider the GitHub validation hook idea to be accepted and start scheming about how to get that done.

destatez commented 3 years ago

Jonathan

You've got it right on as far as I am concerned. I was going to work with uW on hooks for the Unlocked Greek Lexicon, but I wanted a hook that would work off a user's fork and work on only their one change out of the 4K+ elements in the main. It seems that uW can develop hooks that only work against the main and against ALL of its e3lements. That should be no problem with A-S.

Dave

On Tue, Apr 13, 2021 at 4:32 PM Jonathan Robie @.***> wrote:

Chuck has now submitted a pull request that fixes all errors.

If nobody objects before tomorrow, either Chuck or I will accept the pull request. If nobody objects before tomorrow, per email discussion, I will also consider the GitHub validation hook idea to be accepted and start scheming about how to get that done.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/translatable-exegetical-tools/Abbott-Smith/issues/101#issuecomment-819068420, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEACF35DD4UXQVNZHFFQWILTISZ6HANCNFSM4ZNMXXXQ .

jonathanrobie commented 3 years ago

I just merged Chuck's pull request.

A hook that validates the entire lexicon against TEI P5 would be perfect. Is that something you can take on, or should someone else do that?