Open shaked571 opened 5 months ago
That's an interesting point and I agree that this is indeed counterintuitive.
To give some historical context: the sentence/token ids are based on the formatting in the universal dependencies format where sentences/tokens are indexed starting at 1. In universal dependencies token ids can also be of the form e.g. "1-2" for multiword expressions and "1.1" for elided words (such as dropped pronouns that are annotated for syntax but do not correspond to any surface form token).
Meanwhile, the coref_chains indices are based on the format used in conll_transform (https://github.com/boberle/corefconversion) and refer to the literally the index of the sentence or token within the list of tokens.
Probably we should make the "coref_chains" attribute more explicit e.g. something like
"coref_chains": [
{
"cluster_id": int,
"mentions": [
{
"start_index": int,
"start_token_id": int,
"end_index": int,
"end_token_id": int,
}, ...
],
}, ...
]
Or better yet have a documented Python Dataclass that represents each object.
Yap I am aware about the mismatch in the different formating (I have a code that convert my own data to 4 different ones where indiscrim
is the fifth.)
I think that it worth to address this explicitly somewhere
anyhow, thanks on the quick response!
Hey,
I was converting now my own dataset to the
indiscrim
format and I noticed that although the sentence and tokens start from1
thecoref_chains
starts from0
.For example, see
coref-data/preco_indiscrim
Where the first word "They" is a singltion cluster and is represented as[0, 0, 0]
but is under sentence 1 token 1 ([ { "id": 1, "speaker": null, "text": "They say that sticks and stones may break your bones, but words will never hurt you.", "tokens": [ { "id": 1, "text": "They" },
Is it intential or a "bug"?