open-editions / corpus-joyce-ulysses-tei

James Joyce's novel Ulysses in TEI XML. Work-in-progress.
20 stars 17 forks source link

Retagging Joyce’s dialogue #19

Open yellwork opened 7 years ago

yellwork commented 7 years ago

This is a to-do issue to pick out the various tasks discussed in #9:

c-forster commented 7 years ago

Having a to-do list for this seems wise. FYI: The following ack one-liner will extract names from the who attribute of said tags.

ack -o "(?<=<said who=\")[\w\'\. ]*" *.xml

This will compile a sorted list of all the names across the corpus:

ack -ho "(?<=<said who=\")[\w\'\. ]*" *.xml | sort | uniq 

I was using it as a sanity check to catch misspellings when I marked up "Telemachus."

Could we also assign, or let people claim, episodes to mark up with dialogue on this, or another issue? I am going to tackle another episode as soon as I can, and want to avoid reduplicating labor.

yellwork commented 7 years ago

Good idea. Can we formally assign them or do we just call dibs here?

After you started <said> tagging, Chris, I snagged a lot of the low-hanging fruit (the less chatty, shorter episodes). Claiming the longer ones now makes sense because they’re likely to take a considerable bit of time to mark up.

Those ack commands will come in very handy once we start figuring out the speaking parts.

yellwork commented 7 years ago

Going to do the @who attribution on “Proteus” now.

yellwork commented 7 years ago

Going to tackle @who on “Aeolus” now.

JonathanReeve commented 7 years ago

@c-forster, that ack hack is great. I use ag, "The Silver Searcher," myself, and was able to get it to work the same way using ag --nofilename -o "(?<=<said who=\")[\w\'\. ]*" *.xml | sort | uniq. I'll put this into a makefile so that we can run these sorts of things easily.

yellwork commented 7 years ago

I’m simplifying this. A ⟨listPerson⟩ for the entire novel would be incredible, but … too much work for now. So I’m going to switch all @who values to character initials and put the key in the separate plaintext file persons.txt.

JonathanReeve commented 7 years ago

Sounds good. I'm not seeing the key in persons.txt, though? Anyway when it's there, if it's in some kind of regular format, like comma- or tab-separated, then it'll be easy to make a list of these keys to add to the header.

yellwork commented 7 years ago

I’m doing it all offline while I go through all eighteen episodes. I’ll merge them all into the repository once done.

My local persons.txt looks like this:

db [tab]Davy Byrne dbc [tab]Davy Byrne's curate dbm [tab]D.B. Murphy dd [tab]Dan Dawson did [tab]Dilly Dedalus

That could be the basis for a <listPerson> – information I’d love to see added but too much for us right now (I feel).

JonathanReeve commented 7 years ago

Awesome, sounds great.

Ronan Crowley notifications@github.com writes:

I’m doing it all offline while I go through all eighteen episodes. I’ll merge them all into the repository once done.

My local persons.txt looks like this:

db Davy Byrne dbc Davy Byrne's curate dbm D.B. Murphy dd Dan Dawson did Dilly Dedalus

That could be the basis for a <listPerson> – information I’d love to see added but too much for us right now (I feel).

yellwork commented 7 years ago

Some content that was marooned in the closed #9 was your suggestion, Jonathan, for unclear @who values. Something like:

<lb n="060004"/><said xml:id="060004-a" who="Cunningham">―Come on, Simon.
<certainty target="#060004-a" match="@who" locus="value" assertedValue="Power" degree="0.5">
    <desc>It's unclear here whether it's Cunningham or Power speaking.</desc>
</certainty> 
</said>

I’m going to go ahead and use this encoding whenever an unclear speaker is limited to a handful of candidates. Unless you’ve another idea?

JonathanReeve commented 7 years ago

Sounds great. Let's do it. I'll make a note of this in our conventions list, too.

Ronan Crowley notifications@github.com writes:

Some content that was marooned in the closed #9 was your suggestion, Jonathan, for unclear @who values. Something like:

―Come on, Simon.

It's unclear here whether it's Cunningham or Power speaking.

I’m going to go ahead and use this encoding whenever an unclear speaker is limited to a handful of candidates. Unless you’ve another idea?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.*

yellwork commented 7 years ago

How do we attribute dialogue in an exchange between several people ? There’s a spot like this in Hades where no speakers are given for several lines of dialogue:

<lb n="060114"/><said who="lb">―I met M'Coy this morning,</said> Mr Bloom said. <said who="lb">He said he'd try to come.</said></p>
<p><lb n="060115"/>The carriage halted short.
<lb n="060116"/><said who="unclear">―What's wrong?</said>
<lb n="060117"/><said who="unclear">―We're stopped.</said>
<lb n="060118"/><said who="unclear">―Where are we?</said></p>
<p><lb n="060119"/>Mr Bloom put his head out of the window.
<lb n="060120"/><said who="lb">―The grand canal,</said> he said.</p>

The unclears can only be Cunningham, Power or Simon Dedalus (with Bloom, perhaps, chiming in at U 6.117). How best would that be encoded?

JonathanReeve commented 7 years ago

I read the TEI docs on <certainty> again but this is the best I could think of:

<lb n="060114"/><said who="lb">―I met M'Coy this morning,</said> Mr Bloom said. <said who="lb">He said he'd try to come.</said></p>
<p><lb n="060115"/>The carriage halted short.
<lb n="060116"/><said who="unclear">―What's wrong?
<certainty match="@who" locus="value" assertedValue="Power" degree="0.33" />
<certainty match="@who" locus="value" assertedValue="Cunningham" degree="0.33" />
<certainty match="@who" locus="value" assertedValue="Simon Dedalus" degree="0.33" /> 
</said>
<lb n="060117"/><said who="unclear">―We're stopped.
<certainty match="@who" locus="value" assertedValue="Power" degree="0.33" />
<certainty match="@who" locus="value" assertedValue="Cunningham" degree="0.33" />
<certainty match="@who" locus="value" assertedValue="Simon Dedalus" degree="0.33" /> 
</said>
<lb n="060118"/><said who="unclear">―Where are we?
<certainty match="@who" locus="value" assertedValue="Power" degree="0.33" />
<certainty match="@who" locus="value" assertedValue="Cunningham" degree="0.33" />
<certainty match="@who" locus="value" assertedValue="Simon Dedalus" degree="0.33" /> 
</said></p>
<p><lb n="060119"/>Mr Bloom put his head out of the window.
<lb n="060120"/><said who="lb">―The grand canal,</said> he said.</p>

...which is super kludgey and not very DRY. Ideally we could do target="#060116 #060118 #060119" on a single <certainty> set, and avoid all this repetition, but it doesn't look like XML can handle multiple attribute values.

@tcatapano, any ideas?