Retagging Joyce’s dialogue

yellwork commented 7 years ago

This is a to-do issue to pick out the various tasks discussed in #9:

[x] Convert all double-hyphen dialogue dashes to the quotation dash or horizontal bar.
[x] Shift the </said> tags in <said>―</said> structures to the end of character speech. Add all intermedial <said> tagging.
[x] Proof the </said> tagging for every episode. How? We will visualize all of the episodes in a browser and colour just the </said> tagged dialogue. Episodes remaining: 1. “Telemachus” 2. “Nestor” 3. “Proteus” 4. “Calypso” 5. “Lotus Eaters” 6. “Hades” 7. “Aeolus” 8. “Lestrygonians” 9. “Scylla and Charybdis” 10. “Wandering Rocks” 11. “Sirens” 12. “Cyclops” 13. “Nausicaa” 14. “Oxen of the Sun” 15. “Circe” 16. “Eumaeus” 17. “Ithaca” 18. “Penelope”
[x] Disambiguate the appropriate <emph> to <said> tagging. [there might be a few other stragglers]
[x] Add @who attribution for every instance of <said> (or in “Circe” <sp>). Use character names for the values.
[ ] Switch @who values to @xml:id.
[ ] Compile a <listPerson> dossier of speakers.

c-forster commented 7 years ago

Having a to-do list for this seems wise. FYI: The following ack one-liner will extract names from the who attribute of said tags.

ack -o "(?<=<said who=\")[\w\'\. ]*" *.xml

This will compile a sorted list of all the names across the corpus:

ack -ho "(?<=<said who=\")[\w\'\. ]*" *.xml | sort | uniq

I was using it as a sanity check to catch misspellings when I marked up "Telemachus."

Could we also assign, or let people claim, episodes to mark up with dialogue on this, or another issue? I am going to tackle another episode as soon as I can, and want to avoid reduplicating labor.

yellwork commented 7 years ago

Good idea. Can we formally assign them or do we just call dibs here?

After you started <said> tagging, Chris, I snagged a lot of the low-hanging fruit (the less chatty, shorter episodes). Claiming the longer ones now makes sense because they’re likely to take a considerable bit of time to mark up.

Those ack commands will come in very handy once we start figuring out the speaking parts.

yellwork commented 7 years ago

Going to do the @who attribution on “Proteus” now.

yellwork commented 7 years ago

Going to tackle @who on “Aeolus” now.

JonathanReeve commented 7 years ago

@c-forster, that ack hack is great. I use ag, "The Silver Searcher," myself, and was able to get it to work the same way using ag --nofilename -o "(?<=<said who=\")[\w\'\. ]*" *.xml | sort | uniq. I'll put this into a makefile so that we can run these sorts of things easily.

yellwork commented 7 years ago

I’m simplifying this. A ⟨listPerson⟩ for the entire novel would be incredible, but … too much work for now. So I’m going to switch all @who values to character initials and put the key in the separate plaintext file persons.txt.

JonathanReeve commented 7 years ago

Sounds good. I'm not seeing the key in persons.txt, though? Anyway when it's there, if it's in some kind of regular format, like comma- or tab-separated, then it'll be easy to make a list of these keys to add to the header.

yellwork commented 7 years ago

I’m doing it all offline while I go through all eighteen episodes. I’ll merge them all into the repository once done.

My local persons.txt looks like this:

db [tab]Davy Byrne dbc [tab]Davy Byrne's curate dbm [tab]D.B. Murphy dd [tab]Dan Dawson did [tab]Dilly Dedalus

That could be the basis for a <listPerson> – information I’d love to see added but too much for us right now (I feel).

JonathanReeve commented 7 years ago

Awesome, sounds great.

Ronan Crowley notifications@github.com writes:

I’m doing it all offline while I go through all eighteen episodes. I’ll merge them all into the repository once done.

My local persons.txt looks like this:

db Davy Byrne dbc Davy Byrne's curate dbm D.B. Murphy dd Dan Dawson did Dilly Dedalus

That could be the basis for a <listPerson> – information I’d love to see added but too much for us right now (I feel).

yellwork commented 7 years ago

Some content that was marooned in the closed #9 was your suggestion, Jonathan, for unclear @who values. Something like:

<lb n="060004"/><said xml:id="060004-a" who="Cunningham">―Come on, Simon.
<certainty target="#060004-a" match="@who" locus="value" assertedValue="Power" degree="0.5">
    <desc>It's unclear here whether it's Cunningham or Power speaking.</desc>
</certainty> 
</said>

I’m going to go ahead and use this encoding whenever an unclear speaker is limited to a handful of candidates. Unless you’ve another idea?

JonathanReeve commented 7 years ago

Sounds great. Let's do it. I'll make a note of this in our conventions list, too.

Ronan Crowley notifications@github.com writes:

Some content that was marooned in the closed #9 was your suggestion, Jonathan, for unclear @who values. Something like:

―Come on, Simon.
It's unclear here whether it's Cunningham or Power speaking.

I’m going to go ahead and use this encoding whenever an unclear speaker is limited to a handful of candidates. Unless you’ve another idea?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.*

yellwork commented 7 years ago

How do we attribute dialogue in an exchange between several people ? There’s a spot like this in Hades where no speakers are given for several lines of dialogue:

<lb n="060114"/><said who="lb">―I met M'Coy this morning,</said> Mr Bloom said. <said who="lb">He said he'd try to come.</said></p>
<p><lb n="060115"/>The carriage halted short.
<lb n="060116"/><said who="unclear">―What's wrong?</said>
<lb n="060117"/><said who="unclear">―We're stopped.</said>
<lb n="060118"/><said who="unclear">―Where are we?</said></p>
<p><lb n="060119"/>Mr Bloom put his head out of the window.
<lb n="060120"/><said who="lb">―The grand canal,</said> he said.</p>

The unclears can only be Cunningham, Power or Simon Dedalus (with Bloom, perhaps, chiming in at U 6.117). How best would that be encoded?

JonathanReeve commented 7 years ago

I read the TEI docs on <certainty> again but this is the best I could think of:

<lb n="060114"/><said who="lb">―I met M'Coy this morning,</said> Mr Bloom said. <said who="lb">He said he'd try to come.</said></p>
<p><lb n="060115"/>The carriage halted short.
<lb n="060116"/><said who="unclear">―What's wrong?
<certainty match="@who" locus="value" assertedValue="Power" degree="0.33" />
<certainty match="@who" locus="value" assertedValue="Cunningham" degree="0.33" />
<certainty match="@who" locus="value" assertedValue="Simon Dedalus" degree="0.33" /> 
</said>
<lb n="060117"/><said who="unclear">―We're stopped.
<certainty match="@who" locus="value" assertedValue="Power" degree="0.33" />
<certainty match="@who" locus="value" assertedValue="Cunningham" degree="0.33" />
<certainty match="@who" locus="value" assertedValue="Simon Dedalus" degree="0.33" /> 
</said>
<lb n="060118"/><said who="unclear">―Where are we?
<certainty match="@who" locus="value" assertedValue="Power" degree="0.33" />
<certainty match="@who" locus="value" assertedValue="Cunningham" degree="0.33" />
<certainty match="@who" locus="value" assertedValue="Simon Dedalus" degree="0.33" /> 
</said></p>
<p><lb n="060119"/>Mr Bloom put his head out of the window.
<lb n="060120"/><said who="lb">―The grand canal,</said> he said.</p>

...which is super kludgey and not very DRY. Ideally we could do target="#060116 #060118 #060119" on a single <certainty> set, and avoid all this repetition, but it doesn't look like XML can handle multiple attribute values.

@tcatapano, any ideas?

open-editions / corpus-joyce-ulysses-tei

Retagging Joyce’s dialogue #19