vdobrovolskii / wl-coref

This repository contains the code for EMNLP-2021 paper "Word-Level Coreference Resolution"
MIT License
104 stars 37 forks source link

Inference on conversation. #16

Closed maherr13 closed 2 years ago

maherr13 commented 2 years ago

Hello, great work.

I had two questions:

  1. what sent_id in the sample input file supposed to refer to??

  2. If I want to make an inference for dialogue like tc genre, what should be the conversation format ??

vdobrovolskii commented 2 years ago
  1. Sent_id means index of the sentence which the token belongs to. For instance, in "Hello . This is dog speaking ." you will get [0, 0, 1, 1, 1, 1, 1].

  2. What exactly do you mean by conversation format?

maherr13 commented 2 years ago

I mean If I have a conversation suck as :

person_1: Hey, how are you? person_2: I'm fine, how what your day? . . . . and so on

examples like that would be in telephone conversations: tc in ontonote

if I want to use the model to do coref resolution on dialogue like that what should be their format (for example unique tokens around the speakers and so on).

if there is no particular format would you provide an example for such a case?

vdobrovolskii commented 2 years ago
{
        "document_id": "tc_mydoc_001",
        "cased_words": ["Hey", ",", "how", "are", "you", "?", "I", "'m", "fine", ",", "how", "was", "your", "day", "?"],
        "sent_id": [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        "speaker": ["#1", "#1", "#1", "#1", "#1", "#1", "#2", "#2", "#2", "#2", "#2", "#2", "#2", "#2", "#2"]
}

Something like that?
maherr13 commented 2 years ago

that's exactly what I was asking thanks

vdobrovolskii commented 2 years ago

You're welcome! You can also try with punctuation and see if it works better :)