morphgnt / sblgnt

morphological tagging of the SBL Greek New Testament
120 stars 32 forks source link

Make sblgnt-morph data XML-friendly #47

Closed Arithmeticus closed 8 years ago

Arithmeticus commented 8 years ago

I've created a workflow that converts the morphological data in its current form to the TAN format (see http://textalign.net). The most human-intensive operation is that of turning the plain text files to an XML format. I propose a simple change to the current data: wrap each file in an XML shell (prolog + one root element). Obviously, the data has been written to be processed by non-XML tools. But the simple change I propose would open up the data to XML applications without compromising its current workflow, provided that tools were trained to modify or drop the first two lines and the last line.

In this pull request I have also included the stylesheet I have written that transforms the data to XML. The fruits of this labor can be seen at https://github.com/Arithmeticus/TAN-bible/tree/master/TAN-LM.

Note that the stylesheet converts the CCAT-inspired schema to the Perseus one, because the sblgnt-morph project notes that the current scheme will be deprecated. TAN includes a format (TAN-mor) that allows people to define their own morphological rules and codes. See https://github.com/Arithmeticus/TAN-class-3/tree/master/TAN-R-mor for the latest definition of the Perseus tagging scheme, and a few other examples.

jtauber commented 8 years ago

I think a much better way of handling the XML wrapping is to just pre-process the files. Not sure why it would be human-intensive—a few lines of code would do it. I think this code (both the pre-processing and the transform) would better live in a separate repository, although you've gotten me keen to looking into TAN more.

Arithmeticus commented 8 years ago

Thanks, James.

I found out after I posted this that there is a way to pre-process the raw files in XSLT 3.0. I'll probably take that route in any future revision.

When you're ready to revise the morphological codes and taxonomy, I'd appreciate hearing from you off list.

Best wishes,

jk

On Fri, Apr 8, 2016 at 7:39 PM, James Tauber notifications@github.com wrote:

I think a much better way of handling the XML wrapping is to just pre-process the files. Not sure why it would be human-intensive—a few lines of code would do it. I think this code (both the pre-processing and the transform) would better live in a separate repository, although you've gotten me keen to looking into TAN more.

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/morphgnt/sblgnt/pull/47#issuecomment-207648473

Joel Kalvesmaki kalvesmaki.com