srophe / syriaca-data

Repository for Syriaca.org TEI data, used by srophe-eXist-app.
4 stars 16 forks source link

Updating persName texts nodes from new Headwords #123

Open davidamichelson opened 8 years ago

davidamichelson commented 8 years ago

Once the new English headwords have been proofed we would like a script to update the persName textnode records (see #issue72 ) based on the @ref URIs.

For example: http://wwwb.library.vanderbilt.edu/exist/apps/srophe/person/1539

<persName ref="http://syriaca.org/person/1539">ʿAbdishōʿ Dasnāyā</persName> was a monk from

Should become: Abdisho‘ of Dasen

This should wait until we are ready.

davidamichelson commented 8 years ago

This also needs to happen to the beginning strings of the Attestations, which do not have persName tags.

davidamichelson commented 8 years ago

Okay, I am going to link together all the related issues in this one: #536 #356

davidamichelson commented 8 years ago

Here is what needs to be done.

  1. Update all Person Records Titles with New Headwords
    • This should be done in two ways. If the record has person/trait/label/text(anonymous) then the title should be constructed from these two parts: /tei:TEI/tei:text/tei:body/tei:listPerson/tei:person/tei:persName[xml:lang="en"][@syriaca-tags= (or contains) "#syriaca-headword"] - /tei:TEI/tei:text/tei:body/tei:listPerson/tei:person/tei:persName[xml:lang="en"][@syriaca-tags= (or contains) "#anonymous-description"]

If the record is not anonymous then use the existing pattern: /tei:TEI/tei:text/tei:body/tei:listPerson/tei:person/tei:persName[xml:lang="en"][@syriaca-tags= (or contains) "#syriaca-headword"] - /tei:TEI/tei:text/tei:body/tei:listPerson/tei:person/tei:persName[xml:lang="syr"][@syriaca-tags= (or contains) "#syriaca-headword"]

  1. Next we need to update the use of headword in its own record (this script should run only on the record of the headword not all records) in at least two other places:

-in notes or @type="abstract" or "description" or both In either case, look in the person/note to see if there is a persName/@ref=$URI of the person record itself, if so replace the textnode of the persName with the textnode of the headword {persName /tei:TEI/tei:text/tei:body/tei:listPerson/tei:person/tei:persName[xml:lang="en"][@syriaca-tags= (or contains) "#syriaca-headword"]}

NOTE: do not make this replacement on any notes whose parent node is tei:quote

  1. Then we need to replace a text string (non-tagged) form of the name in any /tei:TEI/tei:text/tei:body/tei:listPerson/tei:person/tei:event/@type="attestation".
    • In the attestations, the persName is just the very first string of words and can be captured because it is always followed by "is commemorated in" or "are commemorated in"

Please replace this with a tagged version of the head word persname.

Thus:

<event type="attestation" xml:id="attestation1745-1" source="#bib1745-1">
                        <p xml:lang="en">Tren aḥe d-bdayra d-beth Porsoye is commemorated in <title ref="http://syriaca.org/work/500">Two Persian Brothers (text)</title>.</p>
                    </event>

becomes

<event type="attestation" xml:id="attestation1745-1" source="#bib1745-1">
                        <p xml:lang="en"><persName xml:id="name1745-5" xml:lang="en" resp="http://syriaca.org" syriaca-tags="#syriaca-headword">Anonymi 1745</persName> is commemorated in <title ref="http://syriaca.org/work/500">Two Persian Brothers (text)</title>.</p>
                    </event>
wsalesky commented 8 years ago

@davidamichelson Should I run this on just the items with the new headwords (contains(@xml:id,'-h' ), or on all persons?

davidamichelson commented 8 years ago

I'll try to answer this in about half an hour On Aug 16, 2016, at 8:39 PM, Winona Salesky wrote:

@davidamichelson Should I run this on just the items with the new headwords (contains(@xml:id,'-h' ), or on all persons?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

wsalesky commented 8 years ago

Here is one that has a new English headword, no Syriac headword, but Syriac in the title when I look at the tei: http://wwwb.library.vanderbilt.edu/exist/apps/srophe/person/492/tei

What should I do with it? I think I will update just the English and leave the Syriac.

davidamichelson commented 8 years ago

Thanks for these and sorry I wasn't quite so clear.

First on the Syriac, I guess we actually need this:

If there is a persName[xml:lang="syr"][@syriaca-tags= (or contains) "#syriaca-headword"] use that. If there is not, then use [xml:lang="syr"][1]. Those should probably be the same anyway so if you wanted you could just use [xml:lang="syr"][1]

wsalesky commented 8 years ago

Okay, running some tests. Hope to push changes in less then an hour.

davidamichelson commented 8 years ago

On: Is there a reason I can not do a replace on all tei:persName[@ref=$THE RECORD ID]?

This will probably work. We were only careful to make sure that the first time it occurs would work grammatically, but it would actually be very rare for the persName[@ref=$THE RECORD ID] to occur more than one. The important thing is that the script should not be run on note[@type=abstract]/quote/persName because that would change the quote.

wsalesky commented 8 years ago

Okay. Sounds good.

wsalesky commented 8 years ago

Thanks!

davidamichelson commented 8 years ago

No thank you. I am probably calling it a day, but I will check in and answer questiosn in the morning before my flight.

wsalesky commented 8 years ago

Okay, branch here: https://github.com/srophe/srophe-app-data/tree/issue123

wsalesky commented 8 years ago

XQuery: https://github.com/srophe/srophe-xQueries/blob/master/issue123-headwords.xql

davidamichelson commented 8 years ago

Okay, a few corrections. -Sometimes there is a comma that is inside the persName but not in the or , such as in person/1179 or in person/422 That comma is getting dropped from the title but should be retained in the title please. -One at least one record the revisionDesc/change was repeated for every single update of an attestation. That is not a big deal though, I would say just leave it: https://github.com/srophe/srophe-app-data/compare/dev...issue123?expand=1#diff-d44dbdde0320df962a1df3ceb8d356e3R163

davidamichelson commented 8 years ago

A couple more thoughts. -Did this headword script run on all the abstracts and on all the attestations? It says only 507 files changed which is a little fewer than I would have expected. I would expect that there might be some abstracts and attestations which change even though the titles remained effectively the same string (after the change).

-It looks like the attestations did not change in all records, such as person/1212

davidamichelson commented 8 years ago

Just to clarify, these headword update scripts should run on all person documents

wsalesky commented 8 years ago

New data: https://github.com/srophe/srophe-app-data/tree/issue123-August-17

davidamichelson commented 8 years ago

Hmm, I can't get Google to show a diff file? https://github.com/srophe/srophe-app-data/compare/dev...issue123-August-17

nathangibson commented 8 years ago

We thought you were already in the air, so we did merge in the changes after I looked over them. So if you see anything you want to change you’d need to do it from the current dev branch.

davidamichelson commented 8 years ago

Thanks Gang! As an fyi I will still have e-mail access until about 5 PM Eastern

It mostly looks good but there is one issue with the titles.

It looks like the commas got fixed but now there is extra whitespace in the titls: see https://github.com/srophe/srophe-app-data/commit/c93d8ea7ba12e24ac37aacd9d7c73dfc41677efb#diff-52f66bb3cd450fd5210c131a7f55e97bR7

So if the name has parts there are now two spaces between each part: "Theodosius, metropolitan of Edessa —" It also looks like every name (even if it had only one part element) has two white spaces before the em dash: "Thomas the Stylite -"

Should we just use find and replace to clean this up? Can I please delegate that to the two of you to decide?

Thanks, D

davidamichelson commented 8 years ago

I suppose simple find and replace for double white space with single whitespace in all person titles should do it? @wsalesky am I missing anything?

nathangibson commented 8 years ago

I can do this.