Closed maherr13 closed 2 years ago
Sent_id means index of the sentence which the token belongs to. For instance, in "Hello . This is dog speaking ." you will get [0, 0, 1, 1, 1, 1, 1].
What exactly do you mean by conversation format?
I mean If I have a conversation suck as :
person_1: Hey, how are you? person_2: I'm fine, how what your day? . . . . and so on
examples like that would be in telephone conversations: tc
in ontonote
if I want to use the model to do coref resolution on dialogue like that what should be their format (for example unique tokens around the speakers and so on).
if there is no particular format would you provide an example for such a case?
{
"document_id": "tc_mydoc_001",
"cased_words": ["Hey", ",", "how", "are", "you", "?", "I", "'m", "fine", ",", "how", "was", "your", "day", "?"],
"sent_id": [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1],
"speaker": ["#1", "#1", "#1", "#1", "#1", "#1", "#2", "#2", "#2", "#2", "#2", "#2", "#2", "#2", "#2"]
}
Something like that?
that's exactly what I was asking thanks
You're welcome! You can also try with punctuation and see if it works better :)
Hello, great work.
I had two questions:
what
sent_id
in the sample input file supposed to refer to??If I want to make an inference for dialogue like tc genre, what should be the conversation format ??