Create Arabic-English combined TEI text for testing

nathangibson commented 3 years ago

~~See #10 for format.~~ See below for structure/format.

@wsalesky Here's where I'll be doing this.

nathangibson commented 3 years ago

@wsalesky I finally have something for you to test here: https://github.com/usaybia/usaybia-data/blob/development/data/texts/tei/lhom-12.xml

This is an aligned Arabic-English text, one of 16 chapters total. The text structure (only slightly modified from the original at https://dh.brill.com/scholarlyeditions/reader/urn:cts:arabicLit:0668IbnAbiUsaibia.Tabaqatalatibba.lhom-tr-eng1:12.1/?right=lhom-ed-ara1 is annotated below.

I think I would like to display this as 1 aligned paragraph per page. (The alternative would be 1 biography per page, but some of the biographies are 10 or so printed pages.) So the page would contain

Source attribution (/teiHeader/fileDesc/sourceDesc)
Arabic (left column)
- Chapter title (//head[@xml:lang='ar-Arab'])
- Biography title (//div[@subtype='biography' and @n=$current-bio]/p[@n=1 and @xml:lang='ar-Arab']/hi)
- Current paragraph number ($current-p)
- paragraph (//div[@subtype='biography' and @n=$current-bio]/(p|cit|list)[@n=$current-p and @xml:lang='ar-Arab'])
English (right column)
- Chapter title (//head[@xml:lang='en-Latn'])
- Biography title (//div[@subtype='biography' and @n=$current-bio]/p[@n=1 and @xml:lang='en-Latn']/hi)
- Current paragraph number ($current-p)
- paragraph (//div[@subtype='biography' and @n=$current-bio]/(p|cit|list)[@n=$current-p and @xml:lang='en-Latn'])

Note that head | p | cit | list can all contain note. It would be nice to display these as footnotes.

On small displays, the Arabic and English columns could be stacked.

I like the side column you had earlier for factoids. We could keep this as a side column (I'll provide some factoids for this chapter when I can) or even make it a middle column, between the two languages.

I'd be interested in your thoughts on:

What unit of text our files should contain. Chapter-level is the largest that's really feasible to work with, but I'm wondering if it'd be easier to break them into biography level. There are about 450 bios total in the 16 chapters. Chapters range from 5 to 50 bios. Bios range from a paragraph to several pages. Besides being able to page through them in the reader interface, we also want to easily cite them on the bio level (in person entries) or even the paragraph level (for factoids).
What do you think we should use as URIs for the text? Would something like https://usaybia.net/text/12 be ok? I could stick it as an idno in publicationStmt like Syriac Corpus does.

Annotated structure:

<!-- Container for all chapters, but duplicated in each file since there is currently 1 file per chapter. 
Extraneous here except that it contains the original text's CTS URN --> 
<div type="edition" n="urn:cts:arabicLit:0668IbnAbiUsaibia.Tabaqatalatibba.lhom-ed-ara1" xml:lang="ar-Arab">

  <!-- Container for the entire chapter. n is the first digit in the CTS passage  -->
  <div type="textpart" subtype="chapter" n="12">

    <!-- Container for each biography within the chapter. This is the basic unit of text (except in the preface).
    n is the second and final CTS passage digit resolvable by dh.brill.com -->
    <div type="textpart" subtype="biography" n="1">

      <!-- Arabic header for the entire chapter followed by English headers (usually multiple) 
      I guess it's weird they have this here instead of 1 level higher. Maybe I should move it 
      but in any case I'm pretty sure these are the only headers so you could do //head to grab them --> 
      <head xml:lang="ar-Arab" style="direction:rtl; unicode-bidi:embed">...</head>
      <head xml:lang="en-Latn">Chapter 12 Physicians of India</head>
      <head xml:lang="en-Latn">Bruce Inksetter and Emilie Savage-Smith</head>

      <!-- Arabic paragraph or list inside biography followed by corresponding English paragraph or list. 
      I have added numbers so we can cite them specifically (e.g. in factoids) but these numbers are not resolvable 
      as dh.brill.com CTS URNs. NB: The English sometimes has block quotes that are not broken into separate 
      paragraphs in the Arabic. So I have NOT incremented the n on these. A single Arabic p might correspond to 
      an English p plus cit with the corresponding number. --> 
      <p n="1" xml:lang="ar-Arab" style="direction:rtl; unicode-bidi:embed">...</p>
      <p n="1" xml:lang="en-Latn">...</p>
      <cit n="1" xml:lang="en-Latn">...</cit>
      <list n="2" xml:lang="ar-Arab">...</list>
      <list n="2" xml:lang="en-Latn">...</list>

wsalesky commented 3 years ago

Great, I will get this running this week.

wsalesky commented 3 years ago

@nathangibson This text does not have an idno in the publication statement, this is normally how I access the TEI. If you do not want to add it we can also add it via the doc path, but the URI to the record will not be as tidy.

nathangibson commented 3 years ago

@nathangibson This text does not have an idno in the publication statement, this is normally how I access the TEI. If you do not want to add it we can also add it via the doc path, but the URI to the record will not be as tidy.

@wsalesky I've added the idno now.

Before you get too far into designing the logic for paging through the text, I'd be interested in what you think about whether it would be easier or harder to have the text split into smaller chunks -- the biography level instead of chapter level.

wsalesky commented 3 years ago

@nathangibson Okay, after looking over the data here are my thoughts:

I think chapter is fine, but if you want to break it down to the biography level, that is fine also. The only difference will be how quickly the page renders (the less TEI to convert, the faster the page renders).
I think what you have is fine.

nathangibson commented 3 years ago

@wsalesky I decided to try breaking these up into individual biographies, probably easier for us to manage. Please use the files 12-1 through 12-6 at https://github.com/usaybia/usaybia-data/tree/nathangibson/tagged-names-issue116/data/texts/tei as a test. If it'd be easier for me to merge some stuff into the development branch, just let me know.

It's mostly the same as the above data format with the following differences:

URIs (in publicationStmt/idno) are in format https://usaybia.net/text/12-1. Not sure if this is best practice but it should work for now.
There are 3 levels of titles x 2 languages in the teiHeader: biography, chapter, and work. You could use these as headers (e.g. h1, h2, h3) for the HTML for the Arabic and English columns respectively (filter by xml:lang).
I've removed the outermost div (@type='edition').
The next div is for the chapter, but it is only part of the chapter. The @prev and @next attributes should show you how to navigate forward and backwards for biographies in the chapter. (Semantically, these also indicate that the chapter div spans multiple files, I think.)
Then you have the biography-level div.

For the rest, please apply what I mentioned above.

wsalesky commented 3 years ago

See pull request: https://github.com/usaybia/srophe-eXist-app/pull/86

wsalesky commented 3 years ago

@nathangibson These are ready for review

usaybia / srophe-eXist-app

Create Arabic-English combined TEI text for testing #57