usaybia / usaybia-data

Data for interreligious interaction in Near Eastern texts
MIT License
2 stars 2 forks source link

Determine how to link factoids with text #4

Closed nathangibson closed 3 years ago

nathangibson commented 5 years ago

@wsalesky @dlschwartz I'm trying to figure out how best to link factoids to the text passage they relate to. In SPEAR you do this with CTS URNs. But the text I'm dealing with has no standard divisions smaller than biographical entries, which are typically multiple pages. Anything smaller than that I would have to create on a purely arbitrary basis.

I would like to be able to associate factoids with a paragraph or even a sentence. And when the primary source is displayed as a running text, I would like to be able to display the factoids alongside it (as well as doing like SPEAR and displaying the text passage on the factoid detail page).

I'm going to suggest something wild that could make sense with my workflow, but feel free to reel me back. The use of the rs element came up recently. What if I were to use that to "highlight" a short passage and give it an xml:id and then link to that in the factoid bibl (instead of to a CTS URN)? Or if there should be a two-way link this could be done with linkGrp/link and @targetFunc.

Also, is there any reason the factoids need to be in a separate document from the text itself? Or would it be OK to place them immediately after the relevant paragraph, for ease of maintenance and label them with @type?

Thanks!

See the code example below or in context at https://github.com/usaybia/usaybia-data/blob/eff4877dd7cc3b35b8c3f657ed098c55b1632eaf/data/texts/tei/iu-sample-kopf-en-tan.xml#L111.


<head><persName ref="https://usaybia.net/person/1">Muhadhdhab al-Dīn ʿAbd al-Raḥīm b. ʿAlī</persName></head>
                  <div n="1" type="par"><anchor xml:id="kopf9" corresp="iu-sample-mueller-ar-tan.xml#mueller9"/><p> was our shaikh, the chief of ministers, the learned and worthy 
                     <persName ref="https://usaybia.net/person/1">Muhadhdhab al-Dīn ʿAbd
                     al-Raḥīm b. ʿAlī b. Hāmid, known as al-Dikhwār</persName>. He was — may Allāh have mercy upon him —
                     unique among his generation, peerless in his time, the most learned scholar of his
                     epoch, the apogee of medical skill and knowledge, both general and specialized. No one
                     matched him in diligence or learning. He tired himself out with work, straining his mind
                     in order to gain knowledge, until he surpassed all contemporary physicians and won more
                     remuneration and honor from kings than any doctor had ever before. <rs  xml:id="rs-1">He was born in
                     <placeName>Damascus</placeName> and brought up there by his father, <persName>ʿAlī b. Hamid</persName>, a famous oculist, whose
                        other <pb n="902" ed="kopf"/> son, <persName ref="https://usaybia.net/person/2">Hamid b. ʿAlī</persName>, took up the same profession.</rs> At first,
                     Muhadhadhab al-Dīn was also an oculist, but at the same time he worked as a copyist. His
                     calligraphy was outstanding. He transcribed many books, of which I have seen a hundred
                     or more, dealing with medicine and other sciences. He worked for the Shaikh Tāj al-Dīn
                     al-Kindī Abū al-Yamān, but constantly endeavored to increase his knowledge by reading
                     and memorizing — a habit he kept up until old age.</p>
                     <div type="factoid" uri="https://usaybia.net/factoid/1-1" resp="#ngibson">
                        <listPerson>
                           <person>
                              <persName ref="https://usaybia.net/person/2">Hamid b. ʿAlī</persName>
                           </person>
                        </listPerson>
                        <bibl type="primary">
                           <ptr target="#rs-1"/>
                        </bibl>
                     </div>
                     <div type="factoid" uri="https://usaybia.net/factoid/1-2" resp="#ngibson">
                        <listEvent>
                           <event>
                              <desc>
                                 <persName ref="https://usaybia.net/person/1"> Muhadhdhab al-Dīn ʿAbd al-Raḥīm b. ʿAlī </persName> was born
                                 in <placeName ref="https://usaybia.net/place/1">Damascus</placeName>.</desc>
                              <ptr target="http://syriaca.org/keyword/birth"/>
                           </event>
                        </listEvent>
                        <bibl type="primary">
                           <ptr target="#rs-1"/>
                        </bibl>
                     </div>
                  </div>
wsalesky commented 5 years ago

@nathangibson @dlschwartz is probably the best person to answer this one. It makes sense to me, as does keeping the factoids in the same record as the text, but Dan may have other ideas, as he has done more thinking on the factoid model then I have.

dlschwartz commented 5 years ago

This is a good question and a bit of a sticky one. I've thought about this some because of texts like the letters of Severus and the Lives of John of Ephesus. In neither case do we (yet) use our URN system. Not only do we not have text in the Corpus to point to, but those texts aren't broken down into canonical divisions. In many cases the Letters, and in rare cases the Lives, are really short. In these cases, pointing to the whole thing would work fine just, like pointing to the shorter annals of the Chronicles is fine. However, when it comes to the longer ones, things get a bit more complicated. For that matter, references to the first account of the flooding of Edessa is a bit awkward: http://wwwb.library.vanderbilt.edu/exist/apps/srophe/spear/factoid.html?id=http://syriaca.org/spear/8559-12 .

But first things first. A lot of people aren't crazy about the CTS/DTS URN system, mainly because a browser can't resolve it. Correct me if I'm wrong Winona, but Srophe does what it does with these URNs by converting the Corpus URI-like portion of the URN into the actual URI/URL for the text, and then combines that with the final number in order to treat the whole think like a link to an anchor in the Corpus text. The connection between the URN and it's constituent URIs (both Syriaca and Corpus) make this system somewhat respectable but it's not really ideal. Especially since URNs aren't resolvable URIs, I think looking for other options is a fine idea.

The idea you suggest here will frequently work very well. You get the benefit of being able to point to very specific pieces of text. The downside, however, is that you will likely run into problems with the containing structure of the XML, especially when dealing with events. Off the top of my head, I can think of two tricky scenarios when dealing with complicated events. First, I would guess that at some point you'll run into a situation where the source text for event B starts in the middle and extends beyond the source text for event A. Second, you might want to source an event to the last sentence of one paragraph and the first sentence of the next. You couldn't actually do that. You would have to wrap both of the paragraphs in . If you go down the road you're suggesting you'll probably find it works just fine the overwhelming majority of the time. Just know that at some point your interpretive decisions will butt up against and be controlled by the constraints of XML. I would guess that the compromises you would have to make would rarely be that significant, but they could be.

An alternative would be to use the paragraph numbers to point to the text. This isn't as precise, of course. Also, it solves the former containing problem but leaves you in the same boat regarding the latter (you would still have to point to two whole paragraphs). I've been thinking along these lines for when we actually get the Letters and the Lives into the Corpus. This would essentially turn the paragraph breaks in the printed editions into a canonical division of the text and establish URNs based on those numbers.

Technically, you can store factoids either in the same file or in a separate file. You could use @xml:id attributes to source factoids either way. I obviously don't have this option since factoids aren't part of the Corpus data model. Any other use of the text you produce would just ignore or strip out the factoids. We'll have to talk about this in more detail, but I've been thinking quite a bit about factoids lately and how they actually work. SPEAR TEI models factoids, not persons or sources. The factoid is, in the lingo of the TEI, "stand-off markup" (as is all Syriaca.org data actually). This suggests that perhaps storing them separately from the text is appropriate. That said, I kind of want you to store them in the same file. It might be good for the article we are planning to be able to show different use cases and different workflows. We'll have to discuss this further.

Storing the factoids in the text would complicate multi-lingual display of sources on the factoid page. We've now got a TEI encoding of the Chronicle even though it's not available anywhere yet. When it is, I hope to be able to pull both English and Syriac onto the factoid page. If you wanted similar functionality while storing factoids in the text, you would need to duplicate your elements in both texts. That might get a bit tricky to maintain. Also, you would need a second on each factoid, one pointing to a local and another pointing to an external .

Well, I didn't expect to write such a long email. I hope this is helpful. Mull this over Nathan and perhaps we can chat about these or other issues in person.

dlschwartz commented 5 years ago

By the way, Georg Vogeler strongly recommended that I add a @type="factoid" to my div elements, which I notice you've done Nathan. It's clear to us internally what these things are but we need to be more explicit for users. I'll be doing this for SPEAR shortly. I'm still trying to think through whether I should add @subtype attributes for "nameVariant", "birthDate", "gender", "event", "relation". Perhaps you and I can discuss this Nathan.

nathangibson commented 5 years ago

@dlschwartz Thanks so much for these thoughts. I wonder if we should link to paragraphs as sources rather than to rs elements. Although it's not as granular, the workload may be more realistic. We need to align English and Arabic paragraphs anyway. The paragraphs are what we arbitrarily decide rather than being in any sense canonical. So although we could assign them CTS/DTS URNs, I'm not sure I would really see any point in that.

For aligning English and Arabic, we have to decide whether to simply use anchors or whether to use the more elaborate TAN system. @wsalesky finds having a div per paragraph overly verbose (see https://github.com/usaybia/srophe-eXist-app/issues/1#issuecomment-488687378). If we were using TAN, creating the CTS/DTS URNs would not be difficult. But I would still wonder, given our limited resources, why not simply give each paragraph (div or p) an xml:id and link to that? See https://github.com/usaybia/usaybia-data/blob/master/data/texts/tei/iu-sample-kopf-en-tan.xml#L96 as an example.

In any case, attaching the factoid to the paragraph rather than a smaller text chunk would make it easy to display with either languages, since the paragraphs will be aligned. It would also be unlikely that an event would span more than one paragraph, but when it does we could link to more than one paragraph (rather than having to use multiple rs elements linked together with @previous and @next). I would think we would want to tag names, etc. in both English and Arabic (so that we can make them into links and grab spellings), but maintain factoids only in the Arabic text. What do you think, @wsalesky ?

Placing the factoid immediately in/after the relevant paragraph might be an easier workflow than maintaining it in a separate doc. But I could see doing either one.

One thing I'm envisioning is that we could pre-populate factoids if we adequately tag the text. In the example paragraph I gave, a script could take the persName and placeName elements and create factoids at the end of that paragraph for name variants and events. We would just have to fill in the missing info instead of adding the entire factoid div.

In this use case inside a source text, a @type='factoid' would be important to distinguish divs from regular text divs. It could make sense to even put factoids inside note elements to make it really clear they're not part of the source text, but unfortunately notes can't contain divs.

nathangibson commented 5 years ago

PS @subtype could be helpful but not essential--yes, we can discuss when we meet.

nathangibson commented 4 years ago

Note to self: In the interim we've decided to put factoids in separate docs, one per biographical unit (e.g., 14.21). However, per the changes of using ab instead of div for factoids, it is more conceivable to incorporate these into the text. We need to discuss.