DTS endpoint generation

blms commented 2 years ago

In this PR

Change UTF8 to utf-8 in all manuscript TEI (needed for simple-tei2dtsflat)
Scripts for generating a DTSFlat file structure out of the lemma and manuscript texts (generateDtsData.js), and instructions for serving it with Nginx

Questions

Does it make sense to have generateDtsData.js as a separate script from generateAllData.js, or should it all be part of generateAllData.js since we want to run both every time the lemma is updated?
In generateLemmaTei.js, there's a section I'm not sure about that I would especially love feedback on. It has to do with dealing with opening and closing tags: (I'm not even sure the overall approach of createSectionTei/createNodeTei is robust, so would greatly appreciate any and all feedback) https://github.com/performant-software/chronicleME/blob/b2bbe3ea92da3d5c2a9486b44b034fff82eb6a0a/script/generateLemmaTei.js#L284-L292
My nginx skill isn't terribly strong, so I'm not sure if there's a better way to do this—for example, would it be possible to just serve the DTS endpoints from http://hostname/dts/ instead of http://hostname:3333/dts/ when there's already another service running on :80? I think this would be better but since they both seem to need a root directive, I'm not sure how to do it. The full config is on the server at /etc/nginx/sites-available/default.

NickLaiacona commented 2 years ago

Regarding nginx, I think the website itself is also served on nginx, so if you could use the same config for both the main site and the DTS path, then you wouldn't need to rewrite the port or serve it on a separate port, if I understand what you are looking at here.

Regarding the lemma generation code, there is a an issue here because the graph db does not have the same validation requirements as XML. It is possible the annotations could overlap at the phrase level, with start and end tags crossing the bounds of the parent tag. There are ways to deal with this. One way, which might work best in this case, is to use milestone tags to mark the beginning and ending of annotations. Since milestone elements have no inner XML, they can't overlap. Another way is to compute new elements that don't overlap, so the number of elements you generate may be > than the number of annotations. We might ask Tara what she thinks before going further on this.

blms commented 2 years ago

@NickLaiacona thanks for taking a look!

Regarding nginx, I think the website itself is also served on nginx, so if you could use the same config for both the main site and the DTS path, then you wouldn't need to rewrite the port or serve it on a separate port, if I understand what you are looking at here.

I didn't realize one could so easily redefine the root in a location block! I should have just looked this up. That part's all set now: http://157.245.255.111/ works as normal, and DTS paths like http://157.245.255.111/dts/navigation/?id=lemma and http://157.245.255.111/dts/documents/?id=lemma&ref=section_1019321 are still working as before, but now on port 80. I've also included a sample nginx configuration to simplify the readme.

Regarding the lemma generation code, there is a an issue here because the graph db does not have the same validation requirements as XML. It is possible the annotations could overlap at the phrase level, with start and end tags crossing the bounds of the parent tag. There are ways to deal with this. One way, which might work best in this case, is to use milestone tags to mark the beginning and ending of annotations. Since milestone elements have no inner XML, they can't overlap. Another way is to compute new elements that don't overlap, so the number of elements you generate may be > than the number of annotations. We might ask Tara what she thinks before going further on this.

That sounds good. Could you give me a quick example of what you're thinking with the milestone? I'm not sure I can envision it.

I think your second suggestion is close to what I was working on, but I think I don't handle all cases and the framing of "elements that don't overlap" could help me simplify.

Yes, I'll be curious what Tara thinks! We'll have to bring it up tomorrow.

NickLaiacona commented 2 years ago

Regarding milestone approach, the first sentence in year 401 might look like: Իսկ ընդ աւուրսն ընդ այնոսիկ և ի ամին ՆԱ եղև սով սաստիկ ի բազում տեղիս. բայց յաշխարհն հարաւոյ ի <milestone unit="start" ana="URI"/>երկրին<milestone unit="end" ana="URI"/> տաճկաց եղև նեղութիւն մեծ, և առաւել քան զամենայն ի Միջագետս.

blms commented 8 months ago

@tla Glad to hear you're in touch with Nick! Just had a chance to take another look at this. I can't remember why I left it in a draft state, but my best guess is that it had to do with the overlapping annotations issue Nick mentioned above.

As far as I can tell, we never encountered that issue with the existing data (manuscripts and lemma edition). But maybe there are more texts where annotations do behave in that way, in which case we'd have to make the decision about how to represent them in TEI.

Looks like we have the DTS navigation and document endpoints up and running on a development server now. Here are some example queries:

http://157.245.255.111/dts/navigation/?id=lemma http://157.245.255.111/dts/documents/?id=lemma http://157.245.255.111/dts/documents/?id=lemma&ref=section_1019321 http://157.245.255.111/dts/navigation/?id=M6605 http://157.245.255.111/dts/documents/?id=M6605&ref=M6605_pb210

If you're happy with the TEI here, and the DTS endpoints are responding as expected, we might just be good to merge. We could also meet to discuss further if you'd like.

tla commented 8 months ago

Hi Ben, thanks for getting back to this! I would prefer the standoff / milestone approach to adding the annotations, as it is entirely possible that they could overlap (even if none of them so far do.) This is also because, in the software on our side, an annotation scheme can be entirely user-defined, and I'd like this code base to be usable for editions of other Stemmarest texts with minimal adaptations.

What I would propose is something like this (using the example above): Իսկ ընդ աւուրսն ընդ այնոսիկ և ի ամին ՆԱ եղև սով սաստիկ ի բազում տեղիս. բայց յաշխարհն հարաւոյ ի <milestone type="comment" unit="start" ana="URI"/>երկրին<milestone type="comment" unit="end" ana="URI"/> տաճկաց եղև նեղութիւն մեծ, և առաւել քան զամենայն ի Միջագետս.

where the "type" attribute is set to the annotation type (e.g. "comment", "translation"), and the "unit" attribution is set to the label for the link between the annotation and the reading (e.g. "start", "end").

This leaves the question of what to do about the person / place / date references. I feel a little squeamish about my ad-hoc nomenclature, but it indeed wouldn't make much sense to ignore the mapping between the tagging we did and the corresponding TEI tags. Maybe these could be lightly special-cased in the configuration file, with a mapping like

personref -> persName
person -> person
placeref -> placeName
place -> place
date -> date

and then the Stemmarest annotation labels could be updated to match their TEI equivalents whenever necessary.

The date references are even trickier though and we might have to think about how custom elements are expressed. We have two distinct types of references, which are the dateref and the dating. The dateref is meant to mark where a date is given in the text (basically this is the TEI <date> tag), and the dating is meant to associate the content of the text passage as having occurred on some date. The dating also has an attribute saying whether it is an internal dating (i.e. we are going by the date given in the text and marked with a dateref) or an external dating (i.e. some other scholar says the thing happened on this date.) Both of these sorts of references point to date objects, which have the usual notBefore / notAfter specifications. As far as I know the dating concept doesn't exist as such in TEI.

I'd be happy to hear thoughts about how to handle that last case!

blms commented 7 months ago

@tla Thanks for the detailed response, that all makes sense to me!

Looks like Stemmarest is down, or maybe the URL has changed? I'm not able to access it at https://api.editions.byzantini.st/ChronicleME/stemmarest.

As for the dating concept, would the event tag be applicable?

performant-software / chronicleME

DTS endpoint generation #129

In this PR

Questions