usaybia / srophe-eXist-app

eXist code for Syriaca.org: The Syriac Reference Portal
GNU General Public License v3.0
1 stars 0 forks source link

Research text display options #1

Closed nathangibson closed 4 years ago

nathangibson commented 5 years ago

As part of the usaybia app, I will want to display the entire text I'm working on (500+ pages) and link the factoids to this text. Can you please check into some options for doing this? Some possibilities:

Here are some features I'm interested in

  1. Displaying multiple versions of the text as synchronized columns/views. E.g., Arabic edition, Arabic manuscript image (using IIIF), English translation.
  2. Highlighting line on manuscript corresponding to line in text edition. (I'm currently using Transkribus to align line transcriptions with images.)
  3. Displaying linked info (tagged people/places, factoids) as sidebar and/or linked text.
  4. Displaying multiple sets of page numbers (for different editions, etc.)

I'm sure I'll think of more things later :-) I believe both TEI Publisher and EVT do 1 & 2. And Corpus app does 3.

Here are some files to test (expect irregular encoding and errors!): Arabic edition: https://github.com/usaybia/usaybia-data/blob/master/data/texts/tei/IU.xml English translation (part): https://github.com/usaybia/usaybia-data/blob/master/data/texts/tei/IU-kopf-3.xml One page of transcription aligned line-by-line to image (starting at line 23): https://github.com/usaybia/usaybia-data/blob/master/data/texts/tei/IU-sample-alignment.xml and image it's aligned to: https://dbis-thure.uibk.ac.at/f/Get?id=ZYFQMLOKBOSYTVWOPFLRGZWY&fileType=view

Let me know what you think. Thanks!

wsalesky commented 5 years ago

@nathangibson Do you have page images available for me to link to? Never mind, I see you linked to an image!

nathangibson commented 5 years ago

OK, in case it's helpful, right now I'm using Transkribus to do the transcription. (Not sure whether I'll stay with it, though, so don't put too much effort into Transkribus-specific stuff.) Its REST API is described here. The upshot is you can get the links for all pages in the doc using this https://transkribus.eu/TrpServer/rest/collections/30046/117719/fulldoc.xml. You can also get links there to XML docs with page transcriptions (e.g., https://dbis-thure.uibk.ac.at/f/Get?id=SKXAWHCMHOBXFIFUWPVRBQKV) but I'm not sure whether there's a way to get those in TEI other than exporting them with the desktop app. Let me know if you want an export of the transcription of the entire doc (even though only 1 page is transcribed). That URL is constructed from the Project ID 30046 where I expect all docs to go, and the doc ID 117719.

wsalesky commented 5 years ago

@nathangibson How are you going to be linking the pieces together? I don't see it in the sample records, but maybe I'm missing something obvious. Right now I'm leaning toward using TEI-Publisher since it will not require installing any additional software on the server and it has most of the functionality you are looking for. It may need some tweaking though.

nathangibson commented 5 years ago

I think TEI Publisher is also good from the perspective of being well supported by the community.

Good question on the linking. I'm still figuring that out and am open to suggestions. There are several things to link, so it depends on what you mean.

1. Links between images and transcribed text: This is using the transcription file. Looking at it again, the IU-sample-alignment.xml was in PcGts format whereas I intended to give you a TEI doc. I've updated the doc now. (It only has 2 pages, though.) I've put the image files corresponding to /TEI/facsimile/surface/graphic/@url here https://github.com/usaybia/usaybia-data/tree/master/data/texts/img

2. Links between versions of the text: I think I will do this with pb/@edRef. I've inserted example code at https://github.com/usaybia/usaybia-data/blob/master/data/texts/tei/IU-kopf-3.xml#L10189 and https://github.com/usaybia/usaybia-data/blob/master/data/texts/tei/IU-kopf-3.xml#L10237 and lines 22599 and 22627 of https://github.com/usaybia/usaybia-data/blob/master/data/texts/tei/IU.xml. Let me know if you need smaller files!

3. Linking entities within text: I plan to use the same format we use for Syriaca (persName/@ref, etc.).

4. Linking factoids to text: Not sure yet!

wsalesky commented 5 years ago

Okay, thanks. Let me play around with the examples and see what would work best.

wsalesky commented 5 years ago

@nathangibson can you make me a test xml with just the relevant 2 pages of IU.xml? It will be a little easier for me to experiment with.

nathangibson commented 5 years ago

Here you go: https://github.com/usaybia/usaybia-data/blob/master/data/texts/tei/IU-sample.xml. I also adjusted the edRefs so they point to this sample, and added the links to the transcription doc too: https://github.com/usaybia/usaybia-data/commit/0dd622c9080113e9f600ff22f3d8f997bbf3a163

wsalesky commented 5 years ago

@nathangibson Do you want to keep these (English and Arabic versions) in separate TEI records? Both TEI Publisher and EVT expect them in the same document. I'm sure we could do some hacking to change the default behavior if we need to.

wsalesky commented 5 years ago

@nathangibson question about the https://github.com/usaybia/usaybia-data/blob/master/data/texts/tei/IU-sample-alignment.xml

When I try to map the tei:zone elements to an image I get rather wonkey polygons. I'm now thinking that EVT will be a better option because it already has some text/image alignment built in, however the expectation for the software is coordinates for a rectangle. Here is a TEI example:

<surface corresp="#fol_214v" xml:id="surf_214v">
         <graphic height="1793px" url="images\single\214v.jpg" width="1200px"/>
         <zone lrx="564" lry="464" rend="visible" rendition="signum" ulx="326" uly="154" xml:id="st_214v_001"/>
      </surface>

I tried just using an HTML image map to test where all of your points line up and am getting some odd results. Does Traskribus have different output options?

Screen Shot 2019-04-16 at 2 35 15 PM

Sorry this is taking so long, it is a complicated issue.

wsalesky commented 5 years ago

@nathangibson We need to standardize some XML practices before I go to much farther, to make sure I do not waste time. I don't have any best practice recommendations as long as it is consistent. Do you want me to do some TEI research? Or should I save me time for the development time?

  1. Linking transcriptions/translations/original texts
  2. the issue above about highlighting parts of the image

Are there other issues I can work on in the mean time?

nathangibson commented 5 years ago

@nathangibson Do you want to keep these (English and Arabic versions) in separate TEI records? Both TEI Publisher and EVT expect them in the same document. I'm sure we could do some hacking to change the default behavior if we need to.

Sorry for the long silence! No need to hack--we can put them in the same doc. Only thing is, the text is very long, so we'll probably want to split it into parts, right? (500 pages in the original Arabic)

nathangibson commented 5 years ago

@nathangibson question about the https://github.com/usaybia/usaybia-data/blob/master/data/texts/tei/IU-sample-alignment.xml

When I try to map the tei:zone elements to an image I get rather wonkey polygons. I'm now thinking that EVT will be a better option because it already has some text/image alignment built in, however the expectation for the software is coordinates for a rectangle. Here is a TEI example: ... I tried just using an HTML image map to test where all of your points line up and am getting some odd results. Does Traskribus have different output options? ...

Let me make sure I understand: We need rectangles with @ulx etc coordinates on the zone rather than points?

Is this a requirement of EVT or TEI publisher or both?

Transkribus doesn't give me the options on how I export the coordinates, however it does point me to the XSLT which we could modify if necessary: https://github.com/dariok/page2tei.

Can we try the following to see if it works? Grab rectangle coordinates from the points by tokenizing the point values and then grabbing the min x and y coordinates?

xQuery:

declare default element namespace "http://www.tei-c.org/ns/1.0";

let $point-values := tokenize(//zone/@points,'\s')

let $x :=
for $point in $point-values
return number(tokenize($point,',')[1])

let $y := 
for $point in $point-values
return number(tokenize($point,',')[2])

return element zone {//zone/@*[not(name()='points')], attribute ulx {min($x)}, attribute uly {min($y)}, attribute lrx {max($x)}, attribute lry {max($y)}}

For example, the line you showed (I think) has this zone

<zone points='354,2502 420,2501 487,2499 554,2498 620,2497 687,2496 754,2496 820,2495 887,2495 954,2495 1021,2495 1087,2495 1154,2495 1221,2495 1287,2495 1354,2495 1421,2495 1487,2495 1554,2495 1621,2495 1688,2494 1688,2443 1621,2444 1554,2444 1487,2444 1421,2444 1354,2444 1287,2444 1221,2444 1154,2444 1087,2444 1021,2444 954,2444 887,2444 820,2444 754,2445 687,2445 620,2446 554,2447 487,2448 420,2450 354,2451' rendition='Line' xml:id='facs_103_r2l34'/>

It would become <zone rendition="Line" xml:id="facs_103_r2l34" rotate="0" ulx="354" uly="2443" lrx="1688" lry="2502"/>

Does TEI Publisher not have image alignment built in? I am open to EVT but all else being equal I like the community support for TEI publisher.

wsalesky commented 5 years ago

@nathangibson Did you see an example of TEI publisher where they use the image coordinates? I found a few examples of text/image/transcription, but the image was not using the coordinates to align with the text. Which I think is what you want.

nathangibson commented 5 years ago

I did see an example of this in a demo by https://github.com/mittagessen/kraken but I don't know what kind of tweaking they might have done to TEI-Pub.

I'm beginning to wonder if we should just do the page-by-page image-transcription alignment and not worry about superimposing rectangles for the lines on top of them. In order of importance, the features our users need for the text display are

  1. Factoids linked to source text and vice-versa
  2. English and Arabic texts shown in parallel on a page-by-page basis
  3. Being able to select different Arabic texts to display
  4. Images shown in parallel with text (page-by-page)
  5. Individual lines or words highlighted on image

This last feature I thought we would get for free because Transkribus does automatic line recognition and we are doing a line-by-line transcription. However, it might be more trouble to display this line-by-line alignment than it's worth. Since we are entering line breaks in the transcription, we could simply display the transcribed text line-by-line and it would be easy enough for the user to see how the transcribed lines correspond to the lines in the image, without drawing rectangles on top of the image. I don't want us to waste a lot of time on a feature that is just icing on the cake.

What do you think, shall we go for page-by-page alignment instead?

nathangibson commented 5 years ago

Would you like me to provide another sample file that has the facsimile and all text versions in a single file, with page-by-page alignment?

wsalesky commented 5 years ago

@nathangibson Thanks for the priority list. I think the best thing to do would be to use the Corpus because page-by-page is not hard to add, and shoe horning TEI-Publisher into Syriaca.org just for the page views seems overkill. We can leave 5 as a future feature to develop if we have time. But let's get the other bits out of the way first.

Since we are not using TEI Publisher you can structure your data any way you like as long as there are clear links between all the versions. If you want to go with a file with facsimile and all text versions together that is fine, just send me a sample. What do you think would work best?

nathangibson commented 5 years ago

@wsalesky That sounds good.

I think the most convenient way of arranging the files is to have one file per version, per chapter. There are 15 chapters, and at this point I anticipate 3 aligned versions: Arabic Edition (AE), Arabic Manuscript (AM), and English Translation (ET). We can use those abbreviations for simplicity. So we would have ae01, ae02, ..., ae15, et01, etc. The image-text alignment for each version could be done within each file, i.e., a facsimile node parallel to the text node. By my calculations, each file would be about 300 Kb, which would make it a lot easier to work with than the beastly multi-MB files I have now. How does that sound?

Inserting pb elements with an edRef to a "canonical" version is a somewhat self-explanatory way of aligning the texts but perhaps too idiosyncratic. Perhaps it would be better to use the anchor + corresp system explained here: https://tei-c.org/release/doc/tei-p5-doc/en/html/SA.html#SACS. Any thoughts? I'll check whether the corpus has any alignment mechanism in view.

wsalesky commented 5 years ago

@nathangibson that sounds good to me. How are they going to reference each other? Somewhere in the teiHeader? I have the skeleton of the code built, but will need real data to test it on.

If you can get one text marked up this way (even just one chapter, with matching chapter transcription and translation) I can finish up the code for the display and we can move on to other features.

nathangibson commented 5 years ago

Do you mean how will the different versions reference each other? Perhaps the @corresp could include the entire URI of the referenced text, e.g, <anchor xml:id='p1' corresp='https://usaybia.net/text/ae01#p1/>

Or do you mean how will the parts link to one another? Perhaps using @next and @previous on the body tags in each doc? E.g., <body xml:id='body' previous='https://usaybia.net/text/et01#body' next='https://usaybia.net/text/et03#body'>

nathangibson commented 5 years ago

Or does the Corpus app already have a mechanism for multi-part text files?

wsalesky commented 5 years ago

The Corpus does not, but maybe it is something Jamey has considered? Also how will the translation, and different Arabic versions reference each other?

On Apr 29, 2019, at 11:14 AM, nathangibson notifications@github.com wrote:

Or does the Corpus app already have a mechanism for multi-part text files?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

nathangibson commented 5 years ago

I've run out of time today but will try to work up a sample tomorrow.

nathangibson commented 5 years ago

@jedwardwalters Thanks for the reminder about TAN. I see that there is now a TEI customization that includes it, but doesn't seem to be too limiting in other respects. Giving it a try.

@wsalesky What do you think of the following for aligning ET with AE? Will be adding facsimile image info to the Arabic AE file. Was realizing that the language alignment is really on a semantic level (entries or paragraphs), whereas the image-text alignment has to do with random page breaks. Perhaps it would be better to keep these separate? The language alignment is more important than displaying page images.

For docs on the TAN alignment system, see http://textalign.net/release/TAN-2018/guidelines/xhtml/index.xhtml. I'm not committed to this yet--it's just an interesting possibility that @jedwardwalters mentioned and I'd like to get your feedback on it.

The divs align with each other by type and number. The numbers iterate for each leaf rather than being cumulative. So you would find the two paragraphs to align at the same place in the document tree for both docs, e.g., /TEI/text/body/div[@type='pt' and @n='3']/div[@type='ch' and @n='15']/div[@type='bio' and @n='47']/div[@type='par' and @n='12']

https://github.com/usaybia/usaybia-data/blob/master/data/texts/tei/iu-sample-kopf-en-tan.xml https://github.com/usaybia/usaybia-data/blob/master/data/texts/tei/iu-sample-mueller-ar-tan.xml

nathangibson commented 5 years ago

OK, just updated the Arabic with facsimile links. /TEI/facsimile/surface/ points to the image URL using graphic/@url and @start links to links to the page transcription using //pb/@xml:id.

wsalesky commented 5 years ago

@nathangibson This looks okay to me. I like the info in the header (essentially making it easy for me to group all versions by the work IRI). However I have some reservations about adding all of those divs, it seems overly verbose. I would probably lean toward using the ab/anchor/seg described here: https://tei-c.org/release/doc/tei-p5-doc/en/html/SA.html#SASE

However I am happy to go with whichever version you are most comfortable with. Either way will give me enough hooks to build the alignment functionality. I can have something up for you by Monday.

nathangibson commented 5 years ago

Perhaps you're right--might be overkill. I've now added anchors to those docs (corresp only one direction from English to Arabic). Is this what you had in mind.

If we're not using TAN for aligning the text display, I'm not sure whether I'll use TAN at all. But for now I doubt it will get in the way of you testing the alignment using the anchors, right?

And the way I've done the page images works OK for you?

wsalesky commented 5 years ago

Yes the images work for me. I will do some coding on this over the weekend so you can have something to review Monday.

nathangibson commented 5 years ago

Thanks!

Am 03.05.2019 um 17:28 schrieb Winona Salesky notifications@github.com:

Yes the images work for me. I will do some coding on this over the weekend so you can have something to review Monday.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

nathangibson commented 4 years ago

@wsalesky Closing this too as what you've done is sufficient for the prototype interface. Will make a new issue for the new text files from the new edition/translation