gucorpling / gitdox

Repository for GitDOX, a GitHub Data-storage Online XML editor
Apache License 2.0
15 stars 4 forks source link

TEI conversion for scriptorium instance #54

Closed ctschroeder closed 5 years ago

ctschroeder commented 6 years ago

I kept finding more things, so I created an issue. I think these are all reasonable and doable. Let me know if there are any problems. I will try to have the Johannes 1-doc corpus ready for you to try today. Maybe also Dirt.

I have been playing around with the following changes by modifying existing docs and seeing if my changes validate. So far, they all should validate.

  1. CTS URNs: add document_cts_urn in document metadata as ref attribute:

<title>Apophthegmata Patrum Sahidic 6: anonymous Nau 196</title>

becomes

<title ref="urn:cts:copticLit:ap.6.monbeg">Apophthegmata Patrum Sahidic 6: anonymous Nau 196</title>

  1. lemma converts from lemma (not norm)

  2. add translation to respStmt after annotation, before source <respStmt> <resp>annotation</resp> <name>Christine Luckritz Marquis, Caroline T. Schroeder</name> </respStmt> <respStmt> <resp>translation</resp> <name>Christine Luckritz Marquis</name> </respStmt>

  3. chapter_n, verse_n, vid_n added as div's <div type="chapter" n="1"> <div type="verse" n="1" <div type="vid" n="urn:cts:copticLit:johannes.canons.monbfa:1.1"> NOTE: in Sahidica, verse layer is named "verse"; in our meeting we discussed new document versification getting the layer name "verse_n". Please let @ctschroeder know if chapter_n, verse_n, vid_n annotation layer names need to change. It looks like the current scriptorium-flavored TEI converter in GitDox converts the "verse" layer to <div n="1">.

  4. p must nest inside div tags listed above in 4. This is a TEI converter issue (p must nest) and a Gitdox validation issue (span of p = verse length if verses exist).

ctschroeder commented 6 years ago
  1. Version information not converting. I think the problem is: Converter is expecting version_n and version_date, but GitDox uses version@n and version@date as metadata field names. See this output from Apa Johannes:
    <revisionDesc>
    <change n="%%version_n%%" when="%%version_date%%"></change>
    </revisionDesc>
ctschroeder commented 6 years ago
  1. source is not outputting in the respStmt (see previous bible files for examples)

  2. license is not correctly converting; Apa Johannes has a particular license (CC BY-SA 3.0) in a license field, but the license in the TEI export is CC BY 4.0

amir-zeldes commented 6 years ago

OK, lots of things here:

  1. This is trivial, but I should point out ref's data type should be URI, and I'm not sure if cts:urn is an XML URI. We can either prefix the resolver's domain so it's a URL (which is a URI), or just not worry about this. I'd be for the latter option.
  2. Already done.
  3. Easy to do
  4. Not quite trivial, since chapter is a document level metadatum and not a span annotation (at least in Sahidica). If chapter is a span, it can be done right now. If not, the current architecture does not support this.
  5. Sound like this can be done fine, but I'm not sure why we want both verse divs and paragraphs if they're always the same. It's not a technical problem, but what is the benefit?
  6. I would be against using "@" in metadata names in GitDox. I can either auto-convert them to "_" or we can just not use them.
  7. We may need a separate stylesheet for Bible, if the source metadatum behaves differently.
  8. This is a problem - the license field currently contains something other than what we want to appear in the TEI document. I remember we hard wired this in Excel at some point, but GitDox has no such facility. I would like to reopen the discussion on what this metadatum says and how it relates to what the TEI should say...
amir-zeldes commented 6 years ago

OK, 1., 2. and 3. are now done. Note that if there is no translation etc., and in general to avoid embarrassing %%xyz%% in the output, missing metadata is now replaced with 'none'. Does that work?

ctschroeder commented 6 years ago

Thanks for all this!! Comments:

  1. I think this solution is something another project uses. It came up when I was meeting with Matt Munson here in Leipzig. A URN is basically a URI so yes, I vote for not worrying. ...
  2. I'm not fully understanding the problem. We are introducing chapters as spans (layer name chapter_n). Yes bible corpora are split up into documents by chapter, but that is a matter of convenience not data architecture. Is the way the johannes doc is structured a prob?
  3. both verse divs and paragraphs: I played around with the TEI a LOT. I spent over an hour on this. Basically, it comes down to TEI validation. We can't have phr and s elements without them nested inside a p. If we get rid of the p's, all the phr's become invalid. If you have a better solution, please suggest. I originally was going to post "don't convert p's" but then saw that we needed them. (Plus don't you use the p's in visualizations?)
  4. against using "@" in metadata names in GitDox: yes this is fine not to use @'s in Gitdox; I have never understood why these 2 metadata field names have an @ to begin with. The @ is currently hardwired into GitDox (in the dropdown menu). I'm fine with _ instead of @, but it means we need to change them all, right?
  5. source metadatum: source is not outputting anywhere for any corpus, I think; I mentioned Bible as an example, but we also use source metadata for David Brakke's material, and we will use it for Diliana's and Alin's. So if we can add it to the converter as something not required but produced if it exists...? We definitely don't need a separate converter just for bible; we will have "source" for other corpora.
  6. license: sure we can reopen the conversation. Can we not just convert what is in the metadatum field though? We have at least two corpora with different licenses (Bible, johannes). I think I am not understanding the challenge with this one.
amir-zeldes commented 6 years ago

OK, I auto replaced all @ in metadata with _

So that leaves:

  1. divs : this turns out to be a problem, since you want multiple elements with the same name with the same token span. If I tell gitdox that chapter_n maps to div@n and then that verse maps to div@n, and they have the same span, then the exporter collapses them inside the document body. There's no easy way around this either, it's often really the desired result... I think actually TEI has the option of doing , , - would that be OK for you here? Also, there's no mechanism to auto generate the 'type' attribute of the divs you want - currently each annotation spawns either an element, an attribute or both. The attribute value always comes from the annotation itself.
  2. Adding source is fine, as long as it can be 'none' when it's missing, and the value is identical to the contents of the metadatum. I just added it to the stylesheet, so if this is fine we can cross this out too.
  3. License: the problem is I can only inject the value literally in some position. Generating a hyperlink of the form: <licence target="https://creativecommons.org/licenses/by/4.0/">CC-BY 4.0</licence> then I need to insert two variable parts: the hyperlink, and the literal name of the license. However the actual value in the data model is as in ANNIS: <a href="https://creativecommons.org/licenses/by/4.0/">CC-BY 4.0</a>. So moving from one to the other is not currently possible...
amir-zeldes commented 5 years ago

Fixed values are now supported and the addition of chapter+verse to the schema makes valid TEI generation with divs possible.