vdobrovolskii / wl-coref

This repository contains the code for EMNLP-2021 paper "Word-Level Coreference Resolution"
MIT License
104 stars 37 forks source link

shall I use convert_to_heads when using CoNLL-U? #6

Closed brgsk closed 2 years ago

brgsk commented 2 years ago

Hi, thanks so much for your work! I have a question regarding convert_to_heads.py script. I'm trying to make RoBERTa learn coreference resolution, but my data is in .conllu format. I have quite hard time trying to preprocess data/modify some of your code to make it work. Can you share some insights/thoughts on that? I would be very much obliged.

Cheers

vdobrovolskii commented 2 years ago

Hi! Is there any coreference data in your files? I think, the least painful way will be to convert your data to .conll format as described here in the *_conll File Format section. Not all the columns are necessary, for instance, you can omit lemmas, frameset id, word sense and named entities. Then you use convert_to_jsonlines.py and convert_to_heads.py on the result.

The alternative way will be to drop my convert_to_jsonlines.py script completely and to preprocess your data yourself. You need to output files of the following structure: one json per line, each with the following fields:

document_id:    str,                   # document name
cased_words:    List[str]              # words
sent_id:        List[int]              # word id to sent id mapping
part_id:        int.                   # document part id
speaker:        List[str]              # word id to speaker mapping
pos:            List[str]              # word id to POS mapping
deprel:         List[str]              # word id to dependency relation mapping
head:           List[int]              # word id to head word id mapping, None for root
clusters:       List[List[List[int]]]  # list of clusters,
                                       #     each cluster is a list of spans
                                       #         each span is a list of two ints (start and end word ids)

The resulting file should be passed to convert_to_heads.py.

You can go even further and drop the convert_to_heads.py script. Then you need to output the following jsonlines file:

document_id:    str,                   # document name
cased_words:    List[str]              # words
sent_id:        List[int]              # word id to sent id mapping
part_id:        int.                   # document part id
speaker:        List[str]              # word id to speaker mapping

span_clusters:  List[List[List[int]]]  # list of clusters,
                                       #     each cluster is a list of spans
                                       #         each span is a list of two ints (start and end word ids)

head_clusters:  List[List[int]]        # list of clusters,
                                       #     each cluster is a list of span heads

head2span:      List[List[int]]        # list of training examples
                                       #     each example is a list of three ints
                                       #     head, span start, span end
                                       # this is used to train the model to predict spans from span heads 

Let me know if I can help with anything else.

brgsk commented 2 years ago

@vdobrovolskii thanks for your reply! Looking at convert_to_jsonlines.py right now, and if I understand it and your comment above correctly, then one json per line means one json-formatted sentence?

vdobrovolskii commented 2 years ago

"One json per line" is rather "one document per line". You can read more about the format here.

brgsk commented 2 years ago

Yeah I'm familiar with jsonls, I was wondering about your data's structure. Thanks again, closing this one :+1:

brgsk commented 2 years ago

One more question! In CorefSpansHolder._add_one() when appending span to self.spans why the appended list is [word_id, word_id + 1], shouldn't it be [word_id, word_id]?

vdobrovolskii commented 2 years ago

Span indices don't include the upper bound. For instance, in the following example the span [0, 2] means "words with indices starting with 0 and going up to, but not including, 2", i.e. indices 0 and 1 ("The", "boy").

The boy went upstairs.

By that logic, [word_id, word_id] will be a span of length 0.

brgsk commented 2 years ago

Sure, thanks alot :smiley: