ianporada / coref-data

A collection of coreference datasets in a standardized format
Apache License 2.0
2 stars 1 forks source link

Start token - sentence vs coref_chain #1

Open shaked571 opened 1 month ago

shaked571 commented 1 month ago

Hey,

I was converting now my own dataset to the indiscrim format and I noticed that although the sentence and tokens start from 1 the coref_chains starts from 0.

For example, see coref-data/preco_indiscrim Where the first word "They" is a singltion cluster and is represented as [0, 0, 0] but is under sentence 1 token 1 ([ { "id": 1, "speaker": null, "text": "They say that sticks and stones may break your bones, but words will never hurt you.", "tokens": [ { "id": 1, "text": "They" },

Is it intential or a "bug"?

ianporada commented 1 month ago

That's an interesting point and I agree that this is indeed counterintuitive.

To give some historical context: the sentence/token ids are based on the formatting in the universal dependencies format where sentences/tokens are indexed starting at 1. In universal dependencies token ids can also be of the form e.g. "1-2" for multiword expressions and "1.1" for elided words (such as dropped pronouns that are annotated for syntax but do not correspond to any surface form token).

Meanwhile, the coref_chains indices are based on the format used in conll_transform (https://github.com/boberle/corefconversion) and refer to the literally the index of the sentence or token within the list of tokens.

Probably we should make the "coref_chains" attribute more explicit e.g. something like

"coref_chains": [
    {
        "cluster_id": int,
        "mentions": [
                {
                    "start_index": int,
                    "start_token_id": int,
                    "end_index": int,
                    "end_token_id": int,
                }, ...
        ],
    }, ...
]

Or better yet have a documented Python Dataclass that represents each object.

shaked571 commented 1 month ago

Yap I am aware about the mismatch in the different formating (I have a code that convert my own data to 4 different ones where indiscrim is the fifth.)

I think that it worth to address this explicitly somewhere

anyhow, thanks on the quick response!