Open GoogleCodeExporter opened 9 years ago
The DH community mostly uses TEI elements to encode text, maybe we should chose
types in accordance with what TEI offers, e.g.
http://www.tei-c.org/release/doc/tei-p5-doc/en/html/TS.html? Just an idea.
Original comment by daxenber...@gmail.com
on 11 Mar 2015 at 8:12
There is also some new specifications being worked on for TEI and text:
http://www.tei-c.org/Activities/Council/Working/tcw25.xml
For me the main questions is, how much of these ideas we currently need to
represent. TEI tends to be very complex and we do probably not need all of the
stuff right now.
Original comment by richard.eckart
on 11 Mar 2015 at 8:21
Using TEI naming and semantics is probably a good idea, but I agree with
Richard that we should not introduce everything.
Our policy has always been to only introduce types for which we already have a
use case :)
Original comment by torsten....@gmail.com
on 11 Mar 2015 at 8:23
we should not mix quoted speech (for which we have a new annotator in CoreNLP)
and transcribed speech (for which we do not yet have annotators).
These are also treated different in TEI, see e.g. example of quoted speech
given here:
http://www.tei-c.org/release/doc/tei-p5-doc/de/html/SA.html
Zui-Gan called out to himself every day, ‘Master.’
Then he answered himself, ‘Yes, sir.’
And then he added, ‘Become sober.’
Again he answered, ‘Yes, sir.’
‘And after that,’ he continued, ‘do not be deceived by others.’
‘Yes, sir; yes, sir,’ he replied.
see also: Core Tags for Drama
http://www.tei-c.org/release/doc/tei-p5-doc/de/html/CO.html#CODR
and Performance Texts
http://www.tei-c.org/release/doc/tei-p5-doc/de/html/DR.html#DRPAL
Transcribed speech requires different annotation types that are not relevant
for quoted speech, see Transcriptions of Speech
http://www.tei-c.org/release/doc/tei-p5-doc/de/html/TS.html which Johannes
mentioned earlier
See also:
http://www.tei-c.org/release/doc/tei-p5-doc/de/html/DR.html#DRPAL:
8 Transcriptions of Speech. These would be appropriate for encodings the focus of which is on the actual performance of a text rather than its structure or formal properties. The module described in that chapter includes a large number of other detailed proposals for the encoding of such features as voice quality, prosody, etc., which might be relevant to such a treatment of performance texts.
Judith
Original comment by eckle.kohler
on 11 Mar 2015 at 8:45
Fully agree. Besides the obvious (speaker etc.), TEI has:
- u (utterance): contains a stretch of speech usually preceded and followed by
silence or by a change of speaker.
- pause: marks a pause either between or within utterances.
- vocal: marks any vocalized but not necessarily lexical phenomenon, for
example voiced pauses, non-lexical backchannels, etc.
- kinesic: marks any communicative phenomenon, not necessarily vocalized, for
example a gesture, frown, etc.
- incident: marks any phenomenon or occurrence, not necessarily vocalized or
communicative, for example incidental noises or other events affecting
communication.
- writing: contains a passage of written text revealed to participants in the
course of a spoken text.
- shift: marks the point at which some paralinguistic feature of a series of
utterances by any one speaker changes.
Maybe we should condense this to 3-4 elements, e.g. the first 4?
Original comment by daxenber...@gmail.com
on 11 Mar 2015 at 8:48
Judith is right, we should carefully consider how to separate transcribed
speech (which I had in mind, and which is a relevant document type in DH) and
quoted speech (which CoreNLP offers, but apparently at a very basic level).
Maybe the TEI conventions go to far here, and we should rather keep it simple
for now; with transcribed speech in mind for a later point in time.
Original comment by daxenber...@gmail.com
on 11 Mar 2015 at 9:44
Ok, so I gather we should have at least two annotation types:
One for transcribed speech that roughly correspond to the TEI "u" element.
One for quoted speech that roughly corresponds to the TEI "q" element.
Both should have a feature that indicates who is the speaker, roughly
corresponding to the "who" attribute on the "q" and "u" TEI elements.
My feeling is, that this would cover all immediate needs.
Original comment by richard.eckart
on 11 Mar 2015 at 12:37
Sounds like a good starting point to me.
Original comment by daxenber...@gmail.com
on 11 Mar 2015 at 12:42
I think a good place to put such types would be the api.segmentation module.
Original comment by richard.eckart
on 11 Mar 2015 at 12:44
Hi DKPro Core people!
There is actually an ISO standard for transcriptions of speech now, based on
TEI, resulting from the work presented in the link Richard posted above
(http://www.tei-c.org/Activities/Council/Working/tcw25.xml) and interoperable
with most of the common transcription formats and even some widely used
transcription conventions. You can find more info at
http://www1.ids-mannheim.de/prag/muendlichekorpora/isodin.html and
http://www.exmaralda.org/en/tool/tei_drop/, and otherwise I'd try to answer any
questions you might have...
Best regards,
Hanna Hedeland, HZSK/CLARIN-D
Original comment by hanna.he...@gmail.com
on 11 Mar 2015 at 1:15
I think the annotations "quoted speech" and "speaker" would be definitely
handy.
Two things that would additionally suit my purposes:
1) storing the quoted speech utterances in some sort of structured way, for
example:
‘And after that,’ he continued, ‘do not be deceived by others.’
are two Utterances of one DirectSpeech of one Speaker.
2) storing the speaker as a probability vector - in modern literature it is
less common to see:
"There," said John
but rather things like:
"There!" John's finger pointed to Jack.
Then I would save in the speaker something like [John, 0.9; Jack, 0.1]
But perhaps the general case are annotated, known speakers, and the speaker
prediction is just a special use-case.
Original comment by l.flek...@gmail.com
on 11 Mar 2015 at 1:48
@l.flekova
@1) isn't it one utterance that is interrupted by a piece of text? I mean, I
suppose that "he" didn't actually make a significant pause while saying that.
There has been some contemplating on whether it might be useful/necessary to be
able to model discontinous utterances (or in this case quoted speech spans).
@2) We do not have a concept of probabilities in any DKPro Core types yet.
Right now, we assume that there is one truth. I think it would fall to a
particular experiment setup to sub-class the DKPro Core type adding
probabilities as needed for the individual setup.
Original comment by richard.eckart
on 11 Mar 2015 at 3:07
Hi Richard,
@2) Makes sense, no objections to that.
@1) In principle you are right. I'll follow up if I can think of a
counter-example where one would need to model it.
Original comment by l.flek...@gmail.com
on 11 Mar 2015 at 4:22
Original issue reported on code.google.com by
richard.eckart
on 10 Mar 2015 at 3:34