Retagging Joyce’s dialogue dash

open-editions / corpus-joyce-ulysses-tei

James Joyce's novel Ulysses in TEI XML. Work-in-progress.

20 stars 17 forks source link

Retagging Joyce’s dialogue dash #9

Closed yellwork closed 7 years ago

yellwork commented 7 years ago

We’ve inherited the following tagging convention for Joyce’s dialogue markers throughout the corpus (episodes 15, 17, and 18 excepted):

<lb n="010008"/><said>--</said>Come up, Kinch! Come up, you fearful jesuit!</p>

<said> is just tag abuse here. Eventually it will be used to tag the direct speech, but that’s likely a task for the crowd.

I propose the non-controversial (?) global changes of:

all <said> nesting to be replaced with <q> nesting
all -- to be replaced with ―.

So:

<lb n="010008"/><q>―</q>Come up, Kinch! Come up, you fearful jesuit!</p>

The double hyphen currently in use for the dialogue dash is probably a legacy from a time when the character palette was considerably smaller. But it has no place in the corpus now, I don’t think. Instead we should make a global replace with the quotation dash or horizontal bar (Unicode U+2015 or HTML ―).

So neither the hyphen (-), the en dash (–), nor the em dash/tiret (—) but the quotation dash (―).

Q. Will all platforms support the quotation bar? Markdown XML in Chome has them looking like en dashes. (◔_◔)

Questions and refinements (controversy?) that occur to me: Do we want to encode the dialogue dash at all?

<lb n="010008"/><q/>Come up, Kinch! Come up, you fearful jesuit!</p>

(I don’t know if <q> can be an empty element.) Or, when and if we have the direct speech marked up, we might want to omit both the hard-coded dialogue dash and the <q> tagging:

<lb n="010008"/><said rend="dash">Come up, Kinch! Come up, you fearful jesuit!</said></p>

c-forster commented 7 years ago

So, I"m not sure if I see a big difference between said and q; I think replacing the double hyphen with a quotation bar makes sense.

My question is the following; why is speech kept within the preceding paragraph? For instance,

<p><lb n="100026"/>Father Conmee was very glad to see the wife of Mr David Sheehy
<lb n="100027"/>M. P. looking so well and he begged to be remembered to Mr David Sheehy
<lb n="100028"/>M. P. Yes, he would certainly call.
<lb n="100029"/><said>--</said>Good afternoon, Mrs Sheehy.</p>

I know that in the Gabler edition, there is no indentation for quoted speech (though there is the Random House text), but I think of that as a sort of print convention of spacing, not an indication of a paragraph. New speakers means new paragraphs, am I wrong? I would want the above example to look like this:

<p><lb n="100026"/>Father Conmee was very glad to see the wife of Mr David Sheehy
<lb n="100027"/>M. P. looking so well and he begged to be remembered to Mr David Sheehy
<lb n="100028"/>M. P. Yes, he would certainly call.</p>

<lb n="100029"/><p><said>―Good afternoon, Mrs Sheehy.</said></p>

In part I ask, because, given the existing markup, it seems like you could programmatically replace a said within a p with a closing p, and then wrap everything from the said to the end of the following p with <p><said>. That is, doesn't the existing markup give enough a clue to let regular expressions (or programatic XML parsing) find the "real" structure underneath?

So, two questions I guess:

Am I right that quoted speech should begin its own paragraph?
If so, can't we fix the said (or replace them with q) to correctly tag the spoken text?

JonathanReeve commented 7 years ago

I'm in favor of converting the double hyphens to quotation dashes. I'll go ahead and do that, since that should be an easy sed operation. The French examples in the TEI docs have this syntax:

<said>— Il fait beau,</said> dit Robert.

I like this syntax this best. Even better when it has the @who attribute. There's the variant that moves the dash to a @rend attribute, but I agree that that might be going a little too far.

That's an interesting question as to whether quoted lines are the beginning of their own paragraphs. I'm not sure I know the answer. For typographical purposes, at least, I think we're fine leaving it as-is, since we can always have an XSLT rule makes every line beginning with a quotation dash flush left. More philosophically, there are some paragraphs, around line 200 in Wandering Rocks, for instance, that end with colons, and are followed by quoted lines, which suggest some kind of paragraph-like continuity between the intented block and the quoted line. But that's about as much as I can come up with.

yellwork commented 7 years ago

Direct speech would typically be assigned its own <p>, nesting utterance for utterance. The ‘de-paragraphing of speech’, as he termed it then, was done by Hans Gabler last summer when he was originally preparing these project files for submission to the Oxford Text Archive. Though a 2016 innovation, it represents his editorial vision for Ulysses more finely than was possible in TUSTEP back in the seventies and early eighties. I’d be happy for ‘de-paragraphing’ to remain in the corpus, if we want to link our work to the spirit of that early digital edition, however revitalised and freshly bottled it is now.

Hans’s sense (he writes in an email to me) is that, somewhere between A Portrait and Ulysses, Joyce realised that any marking used to bracket speech – such as the opening, intermediary and final dashes of the Dubliners manuscript – created the illusion of spoken words as existing outside the narrative and, for Joyce, this impression did not square with his increasingly sharpened sense of narrative. Instead, he shifted to the opening dash only, placed moreover in the left margin, to signal speech as integral to the narrative. Hans writes:

If this is the conceptual and structural core of the matter, it highlights our dilemma in structuring the digital data for Ulysses. The pure and totally un-hierarchical string processing as it was only possible forty years ago in TUSTEP (and all other text data processing, no doubt, before SGML and XML) forced us to treat speech as separate paragraphs (since we needed the beginning with a new line). What we now can ‘de-paragraph’ is our data organisation of yore: that is, we can organise speech flush left in new lines, but within narrative paragraphs.

yellwork commented 7 years ago

Thanks for making that global change, Jonathan. I think the syntax of the French example (†) looks very neat with, preferably, a @who attribute and attribution eventually making it into the markup.

† Minus the space between the quotation dash and the first word of direct speech.

A colleague who’s into text mining suggested that nesting the quotation dash would simplify operations for his analysis purposes (so that a narrator’s “someone” and a spoken “―Someone” are not artificially distinguished), but maybe that’s something we just mention in the documentation (“Snip off quotation dashes”) rather than mark up explicitly throughout the corpus?

That said, I’m in favour of retaining the said nesting around our quotation dashes – for now – at least until we can figure out a strategy for tackling the direct-speech encoding.

<lb n="010008"/><said>―</said>Come up, Kinch! Come up, you fearful jesuit!</p>

This example is easy but the task, more generally, might have to be crowdsourced. Although I wonder if we were to compile a dictionary of utterance markers (“said”; “cried”; “murmured” on p. 1 alone) would that help us to automatically detect the position of a closing <said> tag? How strong a general rule is it that:

A punctuation mark in close proximity to an utterance marker means a return to third-person narration. Any material following a full-stop in the third-person narration indicates resumed direct speech.

yellwork commented 7 years ago

Or, when there’s a cluster of <said>―</said> back and forth between speaking characters, would it be worthwhile just shifting all the </said> tags from after the quotation dash to the end of the line preceding the next <said>? That would give us the nesting for stretches of dialogue like the following:

<lb n="080202"/><said>―</said>O, Mr Bloom, how do you do?
<lb n="080203"/><said>―</said>O, how do you do, Mrs Breen?
<lb n="080204"/><said>―</said>No use complaining. How is Molly those times? Haven't seen her for
<lb n="080205"/>ages.
<lb n="080206"/><said>―</said>In the pink, Mr Bloom said gaily. Milly has a position down in Mullingar,
<lb n="080207"/>you know.
<lb n="080208"/><said>―</said>Go away! Isn't that grand for her?
<lb n="080209"/><said>―</said>Yes. In a photographer's there. Getting on like a house on fire. How are
<lb n="080210"/>all your charges?

c-forster commented 7 years ago

The idea of compiling some list of utterance markers to automatically detect where to put said (or q) tags seems attractive to me, because marking speech seems very valuable, and the hassle of doing it manually serious.

It has made me sensitive to a nesting problem related to the question of de-paragraphing. Consider this example from "Scyalla and Charybdis."

<p><lb n="090054"/>A. E. has been telling some yankee interviewer. Wall, tarnation strike
<lb n="090055"/>me!
<lb n="090056"/><said>--</said>The schoolmen were schoolboys first, Stephen said superpolitely.
<lb n="090057"/>Aristotle was once Plato's schoolboy.
<lb n="090058"/><said>--</said>And has remained so, one should hope, John Eglinton sedately said. One
<lb n="090059"/>can see him, a model schoolboy with his diploma under his arm.</p>

How to best encode this? "De-paragraphed" (if I am understanding it) would look like this.

<p><lb n="090054"/>A. E. has been telling some yankee interviewer. Wall, tarnation strike
<lb n="090055"/>me!
<lb n="090056"/><said>--The schoolmen were schoolboys first,</said> Stephen said superpolitely.
<said><lb n="090057"/>Aristotle was once Plato's schoolboy.</said>
<lb n="090058"/><said>--And has remained so, one should hope,</said> John Eglinton sedately said. One
<lb n="090059"/>can see him, a model schoolboy with his diploma under his arm.</said></p>

And that works--the narrative voice get's placed outside the saids, and all of it is a single p, which I think is entirely kosher with TEI rules.

If you tried to put ps inside the saids, what happens to the narrative voice? Does it get its own paragraph (that can't be right)? Am I missing something, or is "de-paragraphing" (which I otherwise think is a non-intuitive, even if justifiable choice) the best choice from a markup perspective?

yellwork commented 7 years ago

Hi Chris, Can you explain in a bit more detail what you mean? I get the example and it’s an interesting one – and, yes, that’s how I’d tag it (but for the extra opening <said> before Eglinton’s second utterance). Are you asking what are we doing with interior monologue?

The larger unit would look like this, if it’s any help:

<lb n="090046"/><said>―All these questions are purely academic,</said> Russell oracled out of his
<lb n="090047"/>shadow. <said>I mean, whether Hamlet is Shakespeare or James I or Essex.
[...]
<lb n="090053"/>ideas. All the rest is the speculation of schoolboys for schoolboys.</said></p>
<p><lb n="090054"/>A. E. has been telling some yankee interviewer. Wall, tarnation strike
<lb n="090055"/>me!
<lb n="090056"/><said>―The schoolmen were schoolboys first,</said> Stephen said superpolitely.

c-forster commented 7 years ago

Apologies if I'm unclear. This isn't supposed to be an interesting example; I'm trying to go for utterly typical.

Above in this thread I had wondered about re-paragraphing the deparagraphed speech. But were one to do that how would you mark it up? I.e. if, as is typical, direct speech is given its own paragraph, how would that be marked up?

Same passage as before, now trying to provide each instance of direct speech its own p (with extra whitespace for readability); the result is invalid because I really don't know how the "Stephen said superpolitely" would be, or could be, marked up here.

<p><lb n="090054"/>A. E. has been telling some yankee interviewer. Wall, tarnation strike
<lb n="090055"/>me!</p>

<lb n="090056"/><said><p>--The schoolmen were schoolboys first,</p></said> Stephen said superpolitely.

So how would one mark up the narrative voice Stephen said superpolitely under such a scheme?

There is no problem with the "de-paragraphed" speech, where multiple instances of speech can be wrapped within a single paragraph.

Perhaps the moral here is simply leave speech "de-paragraphed." But am I missing an obvious alternative?

yellwork commented 7 years ago

But what would be gained by giving each instance of direct speech its own <p> tagging? (When ideally it's have a direct–speech-specific <said> nesting.) The markup in place before the corpus-wide de-paragraphing (with a more nuanced <said>) would have looked like this:

<p><lb n="090054"/>A. E. has been telling some yankee interviewer. Wall, tarnation strike
<lb n="090055"/>me!</p>

<p><lb n="090056"/><said>--The schoolmen were schoolboys first,</said> Stephen said superpolitely.</p>

c-forster commented 7 years ago

OK; I'm not sure why I was getting confused... apologies for the derailment.

So I guess my question boils down to: preserve de-paragraphing or no? I think I prefer the original (un-deparagraphed) markup for speech, though my reasons are simply that it seems more consistent with other novels (typographical) representation of speech. (And, I would argue, it is closer to what someone from outside this project coming to this markup would expect: principle of least surprise.)

If there is a decision on that, I'll try to tackle marking up speech in an episode on two. Given where people's efforts are right now, I think that would be a way to contribute without operating at cross purposes to you and Jonathan.

(I played around last night, and I think with some carefully designed regex, it will be possible to fix the mark of speech manually, but quickly.)

yellwork commented 7 years ago

Is this something that could be split into two separate tasks, Chris? Because while I’d love for the direct speech to be marked up – a huge task – I’d really rather the <p> paragraphing on every utterance wasn’t re-introduced right now. My sense is there’s plenty to be done with the XML that would benefit all projects before some of the more … debatable? changes were inflicted. At that point, all <lb><said>― triads could simply be replaced with:

                                                            </p>
<p><lb n=xxx/><said>―

I get your point about contributor / user expectations and about novelistic conventions, but Joyce is clearly doing something unusual with the convention for representing direct speech. Or at least that’s how Gabler saw it. His edition moved to flush left all instances of direct speech, a decision that departs from all (I think) previous editions of Ulysses and from the vast majority of novels, but which is consistent with / closer to the fair copies of the episodes. This isn’t the place for a discussion of early-twentieth-century norms and expectations for seeing a fair copy through the publication process because, at least for the moment, my sense is that we are working towards a TEI XML version of the Gabler Ulysses. So I’d like our encoding to be done in the spirit of that early digital edition, precisely because it means we now get to tackle aspects of the edition that couldn’t be realised in the seventies and eighties.

c-forster commented 7 years ago

Yes. Absolutely. I have no strong opinion here on how to nest paragraph and saids.

If I get a moment, I may try to tackle an episode--perhaps I'll try "Telemachus"--and share what I come up with via PR.

yellwork commented 7 years ago

Terrific! I’ll take a shot at “Nestor” then over the weekend. I’ll also tackle some of the @who attribution in the episode.

Where would be the best place to store a list of values/speaker names? A comment on this thread that we would just continually edit/augment as the list grew?

yellwork commented 7 years ago

I just did the <said> tagging for “Hades”. The episode has several instances of characters quoting the direct speech of others. For example, here’s Martin Cunningham and Mr Power quoting Tom Kernan:

—Immense, Martin Cunningham said pompously. His singing of that simple ballad, Martin, is the most trenchant rendering I ever heard in the whole course of my experience. —Trenchant, Mr Power said laughing. He’s dead nuts on that. And the retrospective arrangement. (U 6.146–150)

We had already marked up the quoted direct speech using a <said who="Tom Kernan"> tagging:

<lb n="060146"/><said>―</said>Immense, Martin Cunningham said pompously. <said who="Tom Kernan">His singing of that simple
<lb n="060147"/>ballad, Martin, is the most trenchant rendering I ever heard in the whole
<lb n="060148"/>course of my experience.</said>
<lb n="060149"/><said>―</said><said who="Tom Kernan">Trenchant</said>, Mr Power said laughing. He's dead nuts on that. And the
<lb n="060150"/><said who="Tom Kernan">retrospective arrangement</said>.

I’ve just redone the tagging for each speaker (Martin C. and Mr Power) and now we have this:

<lb n="060146"/><said>―Immense,</said> Martin Cunningham said pompously. <said><said who="Tom Kernan">His singing of that simple
<lb n="060147"/>ballad, Martin, is the most trenchant rendering I ever heard in the whole
<lb n="060148"/>course of my experience.</said></said>
<lb n="060149"/><said>―<said who="Tom Kernan">Trenchant</said>,</said> Mr Power said laughing. <said>He's dead nuts on that. And the
<lb n="060150"/><said who="Tom Kernan">retrospective arrangement</said>.</said>

In other words, direct speech quoted by another speaker (and italicized in the text) now appears inside a double <said> nesting. Does that sound reasonable? When we do the @who attribution of speech/<said> for the episode, that means we’ll have explicated when a named speaker reproduces the direct speech of a second named speaker. Is a <said who="X"> within a <said who="Y"> enough to capture this?

c-forster commented 7 years ago

Regarding direct speech quoted by another speaker; are you only marking it up when it is italicized? I think that might be defensible. Other cases, without a typographic marker, seem potentially open to differing interpretations of whether or not the character is being quoted.

yellwork commented 7 years ago

are you only marking it up when it is italicized?

Yes, I think so. We started tackling this phenomenon as part of the <emph> disambiguation. To go further – at least for now – would mean we could only be very selective.

yellwork commented 7 years ago

Following Chris’s lead, I’ve been shifting the closing </said> encoding in some of the shorter episodes. And adding intermedial tagging whereby

<lb n="050090"/><said>―</said>Is there any ... no trouble I hope? I see you're ...
<lb n="050091"/><said>―</said>O, no, Mr Bloom said. Poor Dignam, you know. The funeral is today.

becomes

<lb n="050090"/><said>―Is there any ... no trouble I hope? I see you're ...</said>
<lb n="050091"/><said>―O, no,</said> Mr Bloom said. <said>Poor Dignam, you know. The funeral is today.</said>

So far it’s been pretty straightforward if time consuming. I wonder is there any way to impose quality control on the results save by way of line-by-line or spot checking? Is it possible in GitHub for another user to revert to the earlier <said>―</said> encoding, mark up the dialogue, and then compare the results?

yellwork commented 7 years ago

I didn’t do any of the @who attribution in the few episodes I covered. (I just sped through each file, tweaking the <said> encoding.) I wonder, though, when the time is right if we can put a list of Ulysses’s (speaking?) characters in the header, assign them XML IDs, and then refer to these IDs in the text? Something like:

<lb n="050090"/><said who="#jb">―Is there any ... no trouble I hope? I see you're ...</said>
<lb n="050091"/><said who="#lb">―O, no,</said> Mr Bloom said. <said who="#lb">Poor Dignam, you know. The funeral is today.</said>

and in the header we would have something like:

<listPerson>
 <person xml:id="lb">
   <persName>Leopold Bloom</persName>
 </person>
 <person xml:id="jb">
   <persName>Josie Breen</persName>
 </person>
</listPerson>

And so on. Or is <listPerson> / <person> the wrong structure? <person> has some useful attributes – @role @sex @age – that would really enrich our data and give us interesting results about the frequency and extent of speaking parts.

Q. What about ambiguity? It’s not always clear exactly who is speaking in scenes like the carriage-ride of “Hades.” Q. Would this same @who attribution also work on the <speaker> tagging in “Circe”? (With a <castList>?) At least for “Circe” we should be able to automate some of the attribution, right?

JonathanReeve commented 7 years ago

Declaring characters in the header is a great idea. I love the idea of writing out all we can about them, too. That'll make it really easy to extract all the dialogue from female characters and from male characters, and to run analyses that look at patterns in their respective speech.

There's something to be said for waiting on the XML IDs, though, for the moment, and just using the full names, since it's more human readable. If we invite more contributions from elsewhere (especially from people without much XML experience), it could be useful to make these dialogue attributions clear.

As for ambiguity, the TEI docs have a great page about encoding certainty that could be helpful. I imagine that the most pertinent style is something like this:

I have a <emph xml:id="CE-P3">bun</emph>.

<certainty target="#CE-P3" locus="value" assertedValue="gun" degree="0.8">
 <desc>a gun makes more sense in a holdup</desc>
</certainty>

But in our case it'd be a @who attribute, so it'd be something like:

<lb n="060004"/><said xml:id="060004-a" who="Cunningham">―Come on, Simon.
<certainty target="#060004-a" match="@who" locus="value" assertedValue="Power" degree="0.5">
    <desc>It's unclear here whether it's Cunningham or Power speaking.</desc>
</certainty> 
</said>

This is nice, since it can keep track of marginal uncertainties (degree="0.9"), which could then be used to make "good enough" judgements in any resulting analyses. Of course, it's a lot of typing, so we could always keep doing what you're already doing, i.e. <said who="unclear: Cunningham or Power">, and later programmatically convert that to the more verbose syntax above. If we were to add in our certainty levels like <said who="unclear: Cunningham or Power; 0.8">, meaning 80% certainty for Cunningham, 20% for Power, then we'd be able to fully reproduce the verbose syntax.

However, if this is too complicated, we could also just use GitHub issues to track uncertainties, or put them all in a seperate file, called edge-cases.md, for instance.

yellwork commented 7 years ago

Having character profiles sitting somewhere will make for very interesting analysis. Bloom’s direct speech in “Eumaeus,” for example, is so different from anything else he says in the novel. I’d love to see that tackled properly. I’d also be really interested just to see the raw balance of dialogue between Bloom and Stephen.

For now, though, let’s continue with character names as @who values then, for the reasons Jonathan suggests. They’re easier to keep track of for the encoder and will save us having to compile the header in parallel (at least as far as validation is concerned). Any slip-ups or mistakes will come out once we start the conversion to xml:ids.

I like that solution for speaker ambiguity, Jonathan. What happens when there are more than two potential values, can I ask? For example, I’ve @whoed the following unattributed exchange in “Hades” with a placeholder “unclear” value:

<p><lb n="060115"/>The carriage halted short.
<lb n="060116"/><said who="unclear">―What's wrong?</said>
<lb n="060117"/><said who="unclear">―We're stopped.</said>
<lb n="060118"/><said who="unclear">―Where are we?</said></p>
<p><lb n="060119"/>Mr Bloom put his head out of the window.
<lb n="060120"/><said who="Leopold Bloom">―The grand canal,</said> he said.</p>

If we presume Bloom says none of these unattributed utterances, how best would we capture the ambiguity of the whole exchange? For example, if Cunningham says “What's wrong?” he hardly answers himself with “We're stopped”. Could a tagging encompass the three lines U 6.116–118 and say this is a conversation between two, perhaps three, men from the group Cunningham, Power, and Dedalus? (I presume Bloom is not speaking here, but maybe that’s my subjective reading?)

For “Nestor”, I created a character called @who="unidentified student". I’m sure we’ll have to come up with other workarounds but we’ll catch them when we formalize the @who tagging down the line.

yellwork commented 7 years ago

I was just looking at that incredible moment in “Proteus” when Stephen contemplates visiting his Aunt Sara and dreams up an entire imaginary conversation between his uncle Richie Goulding and himself (with guest vocals from cousin Walter). Here’s some of it:

—It’s Stephen, sir. —Let him in. Let Stephen in. [...] —Yes, sir? —Malt for Richie and Stephen, tell mother. Where is she? —Bathing Crissie, sir. (U 3.72–87 etc.)

Despite the quotation dash and the multiple speaking parts, none of this conversation actually happens outside of Stephen’s mind. There’s something similar in “Lestrygonians” when Bloom imagines an exchange between a “[h]otblooded young student” and a maid-turned-informer:

—Are those yours, Mary? —I don’t wear such things ..... Stop or I’ll tell the missus on you. Out half the night. —There are great times coming, Mary. Wait till you see. —Ah, gelong with your great times coming. (U 8.451–455)

The <said> tagging is easy, and the @who attribution not much more demanding. I’m tempted to leave the encoding at that, but if others encounter similar imaginary conversations in the corpus that are represented with Joyce’s dialogue conventions, we might want to flag them in some way. A @type value? Or is there something more immediate to hand in the Guidelines?

Why it might be worth tackling/tagging this phenomenon is if we ever try to mark up the intrusion of other voices into interior monologue. For example, right before the imagined conversation in “Proteus,” Stephen thinks of “[m]y consubstantial father’s voice” and his interior monologue shifts into Simon Dedalus-ese, complete with Simon’s impersonations of his brother-in-law Richie and nephew Walter:

Did you see anything of your artist brother Stephen lately? No? Sure he’s not down in Strasburg terrace with his aunt Sally? Couldn’t he fly a bit higher than that, eh? And and and and tell us, Stephen, how is uncle Si? O, weeping God, the things I married into! De boys up in de hayloft. The drunken little costdrawer and his brother, the cornet player. Highly respectable gondoliers! And skeweyed Walter sirring his father, no less! Sir. Yes, sir. No, sir. Jesus wept: and no wonder, by Christ! (U 3.61–69; emphasis added)

If we ever do get round to marking up interior monologue, we’d want some way of distinguishing when other voices and other characters appear or are quoted/recalled.