Closed brgsk closed 2 years ago
Hi!
Is there any coreference data in your files? I think, the least painful way will be to convert your data to .conll format as described here in the *_conll File Format section.
Not all the columns are necessary, for instance, you can omit lemmas, frameset id, word sense and named entities.
Then you use convert_to_jsonlines.py
and convert_to_heads.py
on the result.
The alternative way will be to drop my convert_to_jsonlines.py
script completely and to preprocess your data yourself. You need to output files of the following structure:
one json per line, each with the following fields:
document_id: str, # document name
cased_words: List[str] # words
sent_id: List[int] # word id to sent id mapping
part_id: int. # document part id
speaker: List[str] # word id to speaker mapping
pos: List[str] # word id to POS mapping
deprel: List[str] # word id to dependency relation mapping
head: List[int] # word id to head word id mapping, None for root
clusters: List[List[List[int]]] # list of clusters,
# each cluster is a list of spans
# each span is a list of two ints (start and end word ids)
The resulting file should be passed to convert_to_heads.py
.
You can go even further and drop the convert_to_heads.py
script. Then you need to output the following jsonlines file:
document_id: str, # document name
cased_words: List[str] # words
sent_id: List[int] # word id to sent id mapping
part_id: int. # document part id
speaker: List[str] # word id to speaker mapping
span_clusters: List[List[List[int]]] # list of clusters,
# each cluster is a list of spans
# each span is a list of two ints (start and end word ids)
head_clusters: List[List[int]] # list of clusters,
# each cluster is a list of span heads
head2span: List[List[int]] # list of training examples
# each example is a list of three ints
# head, span start, span end
# this is used to train the model to predict spans from span heads
Let me know if I can help with anything else.
@vdobrovolskii thanks for your reply! Looking at convert_to_jsonlines.py
right now, and if I understand it and your comment above correctly, then one json per line means one json-formatted sentence?
"One json per line" is rather "one document per line". You can read more about the format here.
Yeah I'm familiar with jsonl
s, I was wondering about your data's structure.
Thanks again, closing this one :+1:
One more question!
In CorefSpansHolder._add_one()
when appending span to self.spans
why the appended list is [word_id, word_id + 1]
, shouldn't it be [word_id, word_id]
?
Span indices don't include the upper bound.
For instance, in the following example the span [0, 2]
means "words with indices starting with 0 and going up to, but not including, 2", i.e. indices 0 and 1 ("The", "boy").
The boy went upstairs.
By that logic, [word_id, word_id]
will be a span of length 0.
Sure, thanks alot :smiley:
Hi, thanks so much for your work! I have a question regarding
convert_to_heads.py
script. I'm trying to make RoBERTa learn coreference resolution, but my data is in.conllu
format. I have quite hard time trying to preprocess data/modify some of your code to make it work. Can you share some insights/thoughts on that? I would be very much obliged.Cheers