Closed yellwork closed 7 years ago
So, I"m not sure if I see a big difference between said
and q
; I think replacing the double hyphen with a quotation bar makes sense.
My question is the following; why is speech kept within the preceding paragraph? For instance,
<p><lb n="100026"/>Father Conmee was very glad to see the wife of Mr David Sheehy
<lb n="100027"/>M. P. looking so well and he begged to be remembered to Mr David Sheehy
<lb n="100028"/>M. P. Yes, he would certainly call.
<lb n="100029"/><said>--</said>Good afternoon, Mrs Sheehy.</p>
I know that in the Gabler edition, there is no indentation for quoted speech (though there is the Random House text), but I think of that as a sort of print convention of spacing, not an indication of a paragraph. New speakers means new paragraphs, am I wrong? I would want the above example to look like this:
<p><lb n="100026"/>Father Conmee was very glad to see the wife of Mr David Sheehy
<lb n="100027"/>M. P. looking so well and he begged to be remembered to Mr David Sheehy
<lb n="100028"/>M. P. Yes, he would certainly call.</p>
<lb n="100029"/><p><said>―Good afternoon, Mrs Sheehy.</said></p>
In part I ask, because, given the existing markup, it seems like you could programmatically replace a said
within a p
with a closing p
, and then wrap everything from the said
to the end of the following p
with <p><said>
. That is, doesn't the existing markup give enough a clue to let regular expressions (or programatic XML parsing) find the "real" structure underneath?
So, two questions I guess:
said
(or replace them with q
) to correctly tag the spoken text?I'm in favor of converting the double hyphens to quotation dashes. I'll go ahead and do that, since that should be an easy sed
operation. The French examples in the TEI docs have this syntax:
<said>— Il fait beau,</said> dit Robert.
I like this syntax this best. Even better when it has the @who
attribute. There's the variant that moves the dash to a @rend
attribute, but I agree that that might be going a little too far.
That's an interesting question as to whether quoted lines are the beginning of their own paragraphs. I'm not sure I know the answer. For typographical purposes, at least, I think we're fine leaving it as-is, since we can always have an XSLT rule makes every line beginning with a quotation dash flush left. More philosophically, there are some paragraphs, around line 200 in Wandering Rocks, for instance, that end with colons, and are followed by quoted lines, which suggest some kind of paragraph-like continuity between the intented block and the quoted line. But that's about as much as I can come up with.
Direct speech would typically be assigned its own <p>
, nesting utterance for utterance. The ‘de-paragraphing of speech’, as he termed it then, was done by Hans Gabler last summer when he was originally preparing these project files for submission to the Oxford Text Archive. Though a 2016 innovation, it represents his editorial vision for Ulysses more finely than was possible in TUSTEP back in the seventies and early eighties. I’d be happy for ‘de-paragraphing’ to remain in the corpus, if we want to link our work to the spirit of that early digital edition, however revitalised and freshly bottled it is now.
Hans’s sense (he writes in an email to me) is that, somewhere between A Portrait and Ulysses, Joyce realised that any marking used to bracket speech – such as the opening, intermediary and final dashes of the Dubliners manuscript – created the illusion of spoken words as existing outside the narrative and, for Joyce, this impression did not square with his increasingly sharpened sense of narrative. Instead, he shifted to the opening dash only, placed moreover in the left margin, to signal speech as integral to the narrative. Hans writes:
If this is the conceptual and structural core of the matter, it highlights our dilemma in structuring the digital data for Ulysses. The pure and totally un-hierarchical string processing as it was only possible forty years ago in TUSTEP (and all other text data processing, no doubt, before SGML and XML) forced us to treat speech as separate paragraphs (since we needed the beginning with a new line). What we now can ‘de-paragraph’ is our data organisation of yore: that is, we can organise speech flush left in new lines, but within narrative paragraphs.
Thanks for making that global change, Jonathan. I think the syntax of the French example (†) looks very neat with, preferably, a @who
attribute and attribution eventually making it into the markup.
† Minus the space between the quotation dash and the first word of direct speech.
A colleague who’s into text mining suggested that nesting the quotation dash would simplify operations for his analysis purposes (so that a narrator’s “someone” and a spoken “―Someone” are not artificially distinguished), but maybe that’s something we just mention in the documentation (“Snip off quotation dashes”) rather than mark up explicitly throughout the corpus?
That said, I’m in favour of retaining the said
nesting around our quotation dashes – for now – at least until we can figure out a strategy for tackling the direct-speech encoding.
<lb n="010008"/><said>―</said>Come up, Kinch! Come up, you fearful jesuit!</p>
This example is easy but the task, more generally, might have to be crowdsourced. Although I wonder if we were to compile a dictionary of utterance markers (“said”; “cried”; “murmured” on p. 1 alone) would that help us to automatically detect the position of a closing <said>
tag? How strong a general rule is it that:
A punctuation mark in close proximity to an utterance marker means a return to third-person narration. Any material following a full-stop in the third-person narration indicates resumed direct speech.
Or, when there’s a cluster of <said>―</said>
back and forth between speaking characters, would it be worthwhile just shifting all the </said>
tags from after the quotation dash to the end of the line preceding the next <said>
? That would give us the nesting for stretches of dialogue like the following:
<lb n="080202"/><said>―</said>O, Mr Bloom, how do you do?
<lb n="080203"/><said>―</said>O, how do you do, Mrs Breen?
<lb n="080204"/><said>―</said>No use complaining. How is Molly those times? Haven't seen her for
<lb n="080205"/>ages.
<lb n="080206"/><said>―</said>In the pink, Mr Bloom said gaily. Milly has a position down in Mullingar,
<lb n="080207"/>you know.
<lb n="080208"/><said>―</said>Go away! Isn't that grand for her?
<lb n="080209"/><said>―</said>Yes. In a photographer's there. Getting on like a house on fire. How are
<lb n="080210"/>all your charges?
The idea of compiling some list of utterance markers to automatically detect where to put said
(or q
) tags seems attractive to me, because marking speech seems very valuable, and the hassle of doing it manually serious.
It has made me sensitive to a nesting problem related to the question of de-paragraphing. Consider this example from "Scyalla and Charybdis."
<p><lb n="090054"/>A. E. has been telling some yankee interviewer. Wall, tarnation strike
<lb n="090055"/>me!
<lb n="090056"/><said>--</said>The schoolmen were schoolboys first, Stephen said superpolitely.
<lb n="090057"/>Aristotle was once Plato's schoolboy.
<lb n="090058"/><said>--</said>And has remained so, one should hope, John Eglinton sedately said. One
<lb n="090059"/>can see him, a model schoolboy with his diploma under his arm.</p>
How to best encode this? "De-paragraphed" (if I am understanding it) would look like this.
<p><lb n="090054"/>A. E. has been telling some yankee interviewer. Wall, tarnation strike
<lb n="090055"/>me!
<lb n="090056"/><said>--The schoolmen were schoolboys first,</said> Stephen said superpolitely.
<said><lb n="090057"/>Aristotle was once Plato's schoolboy.</said>
<lb n="090058"/><said>--And has remained so, one should hope,</said> John Eglinton sedately said. One
<lb n="090059"/>can see him, a model schoolboy with his diploma under his arm.</said></p>
And that works--the narrative voice get's placed outside the said
s, and all of it is a single p
, which I think is entirely kosher with TEI rules.
If you tried to put p
s inside the said
s, what happens to the narrative voice? Does it get its own paragraph (that can't be right)? Am I missing something, or is "de-paragraphing" (which I otherwise think is a non-intuitive, even if justifiable choice) the best choice from a markup perspective?
Hi Chris,
Can you explain in a bit more detail what you mean? I get the example and it’s an interesting one – and, yes, that’s how I’d tag it (but for the extra opening <said>
before Eglinton’s second utterance). Are you asking what are we doing with interior monologue?
The larger unit would look like this, if it’s any help:
<lb n="090046"/><said>―All these questions are purely academic,</said> Russell oracled out of his
<lb n="090047"/>shadow. <said>I mean, whether Hamlet is Shakespeare or James I or Essex.
[...]
<lb n="090053"/>ideas. All the rest is the speculation of schoolboys for schoolboys.</said></p>
<p><lb n="090054"/>A. E. has been telling some yankee interviewer. Wall, tarnation strike
<lb n="090055"/>me!
<lb n="090056"/><said>―The schoolmen were schoolboys first,</said> Stephen said superpolitely.
Apologies if I'm unclear. This isn't supposed to be an interesting example; I'm trying to go for utterly typical.
Above in this thread I had wondered about re-paragraphing the deparagraphed speech. But were one to do that how would you mark it up? I.e. if, as is typical, direct speech is given its own paragraph, how would that be marked up?
Same passage as before, now trying to provide each instance of direct speech its own p
(with extra whitespace for readability); the result is invalid because I really don't know how the "Stephen said superpolitely" would be, or could be, marked up here.
<p><lb n="090054"/>A. E. has been telling some yankee interviewer. Wall, tarnation strike
<lb n="090055"/>me!</p>
<lb n="090056"/><said><p>--The schoolmen were schoolboys first,</p></said> Stephen said superpolitely.
So how would one mark up the narrative voice Stephen said superpolitely
under such a scheme?
There is no problem with the "de-paragraphed" speech, where multiple instances of speech can be wrapped within a single paragraph.
Perhaps the moral here is simply leave speech "de-paragraphed." But am I missing an obvious alternative?
But what would be gained by giving each instance of direct speech its own <p>
tagging? (When ideally it's have a direct–speech-specific <said>
nesting.)
The markup in place before the corpus-wide de-paragraphing (with a more nuanced <said>
) would have looked like this:
<p><lb n="090054"/>A. E. has been telling some yankee interviewer. Wall, tarnation strike
<lb n="090055"/>me!</p>
<p><lb n="090056"/><said>--The schoolmen were schoolboys first,</said> Stephen said superpolitely.</p>
OK; I'm not sure why I was getting confused... apologies for the derailment.
So I guess my question boils down to: preserve de-paragraphing or no? I think I prefer the original (un-deparagraphed) markup for speech, though my reasons are simply that it seems more consistent with other novels (typographical) representation of speech. (And, I would argue, it is closer to what someone from outside this project coming to this markup would expect: principle of least surprise.)
If there is a decision on that, I'll try to tackle marking up speech in an episode on two. Given where people's efforts are right now, I think that would be a way to contribute without operating at cross purposes to you and Jonathan.
(I played around last night, and I think with some carefully designed regex, it will be possible to fix the mark of speech manually, but quickly.)
Is this something that could be split into two separate tasks, Chris? Because while I’d love for the direct speech to be marked up – a huge task – I’d really rather the <p>
paragraphing on every utterance wasn’t re-introduced right now. My sense is there’s plenty to be done with the XML that would benefit all projects before some of the more … debatable? changes were inflicted. At that point, all <lb><said>―
triads could simply be replaced with:
</p>
<p><lb n=xxx/><said>―
I get your point about contributor / user expectations and about novelistic conventions, but Joyce is clearly doing something unusual with the convention for representing direct speech. Or at least that’s how Gabler saw it. His edition moved to flush left all instances of direct speech, a decision that departs from all (I think) previous editions of Ulysses and from the vast majority of novels, but which is consistent with / closer to the fair copies of the episodes. This isn’t the place for a discussion of early-twentieth-century norms and expectations for seeing a fair copy through the publication process because, at least for the moment, my sense is that we are working towards a TEI XML version of the Gabler Ulysses. So I’d like our encoding to be done in the spirit of that early digital edition, precisely because it means we now get to tackle aspects of the edition that couldn’t be realised in the seventies and eighties.
Yes. Absolutely. I have no strong opinion here on how to nest paragraph and said
s.
If I get a moment, I may try to tackle an episode--perhaps I'll try "Telemachus"--and share what I come up with via PR.
Terrific! I’ll take a shot at “Nestor” then over the weekend. I’ll also tackle some of the @who
attribution in the episode.
Where would be the best place to store a list of values/speaker names? A comment on this thread that we would just continually edit/augment as the list grew?
I just did the <said>
tagging for “Hades”. The episode has several instances of characters quoting the direct speech of others. For example, here’s Martin Cunningham and Mr Power quoting Tom Kernan:
—Immense, Martin Cunningham said pompously. His singing of that simple ballad, Martin, is the most trenchant rendering I ever heard in the whole course of my experience. —Trenchant, Mr Power said laughing. He’s dead nuts on that. And the retrospective arrangement. (U 6.146–150)
We had already marked up the quoted direct speech using a <said who="Tom Kernan">
tagging:
<lb n="060146"/><said>―</said>Immense, Martin Cunningham said pompously. <said who="Tom Kernan">His singing of that simple
<lb n="060147"/>ballad, Martin, is the most trenchant rendering I ever heard in the whole
<lb n="060148"/>course of my experience.</said>
<lb n="060149"/><said>―</said><said who="Tom Kernan">Trenchant</said>, Mr Power said laughing. He's dead nuts on that. And the
<lb n="060150"/><said who="Tom Kernan">retrospective arrangement</said>.
I’ve just redone the tagging for each speaker (Martin C. and Mr Power) and now we have this:
<lb n="060146"/><said>―Immense,</said> Martin Cunningham said pompously. <said><said who="Tom Kernan">His singing of that simple
<lb n="060147"/>ballad, Martin, is the most trenchant rendering I ever heard in the whole
<lb n="060148"/>course of my experience.</said></said>
<lb n="060149"/><said>―<said who="Tom Kernan">Trenchant</said>,</said> Mr Power said laughing. <said>He's dead nuts on that. And the
<lb n="060150"/><said who="Tom Kernan">retrospective arrangement</said>.</said>
In other words, direct speech quoted by another speaker (and italicized in the text) now appears inside a double <said>
nesting. Does that sound reasonable? When we do the @who
attribution of speech/<said>
for the episode, that means we’ll have explicated when a named speaker reproduces the direct speech of a second named speaker. Is a <said who="X">
within a <said who="Y">
enough to capture this?
Regarding direct speech quoted by another speaker; are you only marking it up when it is italicized? I think that might be defensible. Other cases, without a typographic marker, seem potentially open to differing interpretations of whether or not the character is being quoted.
are you only marking it up when it is italicized?
Yes, I think so. We started tackling this phenomenon as part of the <emph>
disambiguation. To go further – at least for now – would mean we could only be very selective.
Following Chris’s lead, I’ve been shifting the closing </said>
encoding in some of the shorter episodes. And adding intermedial tagging whereby
<lb n="050090"/><said>―</said>Is there any ... no trouble I hope? I see you're ...
<lb n="050091"/><said>―</said>O, no, Mr Bloom said. Poor Dignam, you know. The funeral is today.
becomes
<lb n="050090"/><said>―Is there any ... no trouble I hope? I see you're ...</said>
<lb n="050091"/><said>―O, no,</said> Mr Bloom said. <said>Poor Dignam, you know. The funeral is today.</said>
So far it’s been pretty straightforward if time consuming. I wonder is there any way to impose quality control on the results save by way of line-by-line or spot checking? Is it possible in GitHub for another user to revert to the earlier <said>―</said>
encoding, mark up the dialogue, and then compare the results?
I didn’t do any of the @who
attribution in the few episodes I covered. (I just sped through each file, tweaking the <said>
encoding.) I wonder, though, when the time is right if we can put a list of Ulysses’s (speaking?) characters in the header, assign them XML IDs, and then refer to these IDs in the text? Something like:
<lb n="050090"/><said who="#jb">―Is there any ... no trouble I hope? I see you're ...</said>
<lb n="050091"/><said who="#lb">―O, no,</said> Mr Bloom said. <said who="#lb">Poor Dignam, you know. The funeral is today.</said>
and in the header we would have something like:
<listPerson>
<person xml:id="lb">
<persName>Leopold Bloom</persName>
</person>
<person xml:id="jb">
<persName>Josie Breen</persName>
</person>
</listPerson>
And so on. Or is <listPerson>
/ <person>
the wrong structure? <person>
has some useful attributes – @role
@sex
@age
– that would really enrich our data and give us interesting results about the frequency and extent of speaking parts.
Q. What about ambiguity? It’s not always clear exactly who is speaking in scenes like the carriage-ride of “Hades.”
Q. Would this same @who
attribution also work on the <speaker>
tagging in “Circe”? (With a <castList>
?) At least for “Circe” we should be able to automate some of the attribution, right?
Declaring characters in the header is a great idea. I love the idea of writing out all we can about them, too. That'll make it really easy to extract all the dialogue from female characters and from male characters, and to run analyses that look at patterns in their respective speech.
There's something to be said for waiting on the XML IDs, though, for the moment, and just using the full names, since it's more human readable. If we invite more contributions from elsewhere (especially from people without much XML experience), it could be useful to make these dialogue attributions clear.
As for ambiguity, the TEI docs have a great page about encoding certainty that could be helpful. I imagine that the most pertinent style is something like this:
I have a <emph xml:id="CE-P3">bun</emph>.
<certainty target="#CE-P3" locus="value" assertedValue="gun" degree="0.8">
<desc>a gun makes more sense in a holdup</desc>
</certainty>
But in our case it'd be a @who
attribute, so it'd be something like:
<lb n="060004"/><said xml:id="060004-a" who="Cunningham">―Come on, Simon.
<certainty target="#060004-a" match="@who" locus="value" assertedValue="Power" degree="0.5">
<desc>It's unclear here whether it's Cunningham or Power speaking.</desc>
</certainty>
</said>
This is nice, since it can keep track of marginal uncertainties (degree="0.9"
), which could then be used to make "good enough" judgements in any resulting analyses. Of course, it's a lot of typing, so we could always keep doing what you're already doing, i.e. <said who="unclear: Cunningham or Power">
, and later programmatically convert that to the more verbose syntax above. If we were to add in our certainty levels like <said who="unclear: Cunningham or Power; 0.8">
, meaning 80% certainty for Cunningham, 20% for Power, then we'd be able to fully reproduce the verbose syntax.
However, if this is too complicated, we could also just use GitHub issues to track uncertainties, or put them all in a seperate file, called edge-cases.md, for instance.
Having character profiles sitting somewhere will make for very interesting analysis. Bloom’s direct speech in “Eumaeus,” for example, is so different from anything else he says in the novel. I’d love to see that tackled properly. I’d also be really interested just to see the raw balance of dialogue between Bloom and Stephen.
For now, though, let’s continue with character names as @who
values then, for the reasons Jonathan suggests. They’re easier to keep track of for the encoder and will save us having to compile the header in parallel (at least as far as validation is concerned). Any slip-ups or mistakes will come out once we start the conversion to xml:ids.
I like that solution for speaker ambiguity, Jonathan. What happens when there are more than two potential values, can I ask? For example, I’ve @who
ed the following unattributed exchange in “Hades” with a placeholder “unclear” value:
<p><lb n="060115"/>The carriage halted short.
<lb n="060116"/><said who="unclear">―What's wrong?</said>
<lb n="060117"/><said who="unclear">―We're stopped.</said>
<lb n="060118"/><said who="unclear">―Where are we?</said></p>
<p><lb n="060119"/>Mr Bloom put his head out of the window.
<lb n="060120"/><said who="Leopold Bloom">―The grand canal,</said> he said.</p>
If we presume Bloom says none of these unattributed utterances, how best would we capture the ambiguity of the whole exchange? For example, if Cunningham says “What's wrong?” he hardly answers himself with “We're stopped”. Could a
For “Nestor”, I created a character called @who="unidentified student"
. I’m sure we’ll have to come up with other workarounds but we’ll catch them when we formalize the @who
tagging down the line.
I was just looking at that incredible moment in “Proteus” when Stephen contemplates visiting his Aunt Sara and dreams up an entire imaginary conversation between his uncle Richie Goulding and himself (with guest vocals from cousin Walter). Here’s some of it:
—It’s Stephen, sir. —Let him in. Let Stephen in. [...] —Yes, sir? —Malt for Richie and Stephen, tell mother. Where is she? —Bathing Crissie, sir. (U 3.72–87 etc.)
Despite the quotation dash and the multiple speaking parts, none of this conversation actually happens outside of Stephen’s mind. There’s something similar in “Lestrygonians” when Bloom imagines an exchange between a “[h]otblooded young student” and a maid-turned-informer:
—Are those yours, Mary? —I don’t wear such things ..... Stop or I’ll tell the missus on you. Out half the night. —There are great times coming, Mary. Wait till you see. —Ah, gelong with your great times coming. (U 8.451–455)
The <said>
tagging is easy, and the @who
attribution not much more demanding. I’m tempted to leave the encoding at that, but if others encounter similar imaginary conversations in the corpus that are represented with Joyce’s dialogue conventions, we might want to flag them in some way. A @type
value? Or is there something more immediate to hand in the Guidelines?
Why it might be worth tackling/tagging this phenomenon is if we ever try to mark up the intrusion of other voices into interior monologue. For example, right before the imagined conversation in “Proteus,” Stephen thinks of “[m]y consubstantial father’s voice” and his interior monologue shifts into Simon Dedalus-ese, complete with Simon’s impersonations of his brother-in-law Richie and nephew Walter:
Did you see anything of your artist brother Stephen lately? No? Sure he’s not down in Strasburg terrace with his aunt Sally? Couldn’t he fly a bit higher than that, eh? And and and and tell us, Stephen, how is uncle Si? O, weeping God, the things I married into! De boys up in de hayloft. The drunken little costdrawer and his brother, the cornet player. Highly respectable gondoliers! And skeweyed Walter sirring his father, no less! Sir. Yes, sir. No, sir. Jesus wept: and no wonder, by Christ! (U 3.61–69; emphasis added)
If we ever do get round to marking up interior monologue, we’d want some way of distinguishing when other voices and other characters appear or are quoted/recalled.
We’ve inherited the following tagging convention for Joyce’s dialogue markers throughout the corpus (episodes 15, 17, and 18 excepted):
<said>
is just tag abuse here. Eventually it will be used to tag the direct speech, but that’s likely a task for the crowd.I propose the non-controversial (?) global changes of:
<said>
nesting to be replaced with<q>
nestingSo:
The double hyphen currently in use for the dialogue dash is probably a legacy from a time when the character palette was considerably smaller. But it has no place in the corpus now, I don’t think. Instead we should make a global replace with the quotation dash or horizontal bar (Unicode U+2015 or HTML
―
).So neither the hyphen (-), the en dash (–), nor the em dash/tiret (—) but the quotation dash (―).
Q. Will all platforms support the quotation bar? Markdown XML in Chome has them looking like en dashes. (◔_◔)
Questions and refinements (controversy?) that occur to me: Do we want to encode the dialogue dash at all?
(I don’t know if
<q>
can be an empty element.) Or, when and if we have the direct speech marked up, we might want to omit both the hard-coded dialogue dash and the<q>
tagging: