Questions_dataset-representation

vdobrovolskii / wl-coref

This repository contains the code for EMNLP-2021 paper "Word-Level Coreference Resolution"

MIT License

104 stars 37 forks source link

Questions_dataset-representation #10

Closed fajarmuslim closed 2 years ago

fajarmuslim commented 2 years ago

Based on my observation in this code base, training use the following features, e.g: cased_words, sent_id, speaker, pos, deprel, head, clusters.

then converted into: cased_words, sent_id, speaker, pos, deprel, head, head2span, word_clusters, span_clusters.

while in inference data example, the feature used only cased_words, sent_id, and optionally speaker information.

My questions is.

how we get the pos, deprel, head, and clusters data from in inference mode? It is derived from cased_words or not?
in training mode, is the speaker, pos, deprel, head, clusters data is used as well?

Thank you

vdobrovolskii commented 2 years ago

Hi!

Training and evaluation themselves require the following: "cased_words": tokenized words of the text "sent_id": sentence index for each word of the text "speaker": speaker name for each word of the text "head2span": triples of [span head, span start, span end] to train the span predictor "word_clusters": lists of coreference clusters, where each cluster is a list of word indices "span_clusters": lists of coreference clusters, where each cluster is a list of spans

"pos", "deprel", "head" are only used during data preparation to be able to convert a span-based dataset to a word-based one.

Inference only needs the following: "cased_words": tokenized words of the text "sent_id": sentence index for each word of the text "speaker": speaker name for each word of the text (optional)

We don't get the pos, deprel and head data during inference, because we don't use them. Cluster data is the actual output of the model.
See above.

fajarmuslim commented 2 years ago

Thanks for the explanation....

For the head2span. how we get the 'head' data?

I still confused in this line of code. What is actually avg_spans do? why it is required? avg_spans = sum(len(doc["head2span"]) for doc in docs) / len(docs)

fajarmuslim commented 2 years ago

span start, end is the index where the span starting, ending in a text? or it is start, end of span in sentence?

vdobrovolskii commented 2 years ago

Heads are calculated in this function here.

avg_spans from the line of the code you quoted calculate the average number of coreferent spans in a document. It is used to weigh the loss function here

Span start and end are the word indices of the text, not the sentence.

fajarmuslim commented 2 years ago

How we get the doc['head']. Is it given by OntoNotes05 dataset?

vdobrovolskii commented 2 years ago

More or less. The OntoNotes has got constituency syntax data which is converted to dependency syntax data. This is where the head/deprel/pos come from.

The reason for the conversion was because it was easier for me to deal with dependency graphs than with constituency structures. But both can be used, although one will need to rewrite the convert_to_heads bit to make it work with constituents.

fajarmuslim commented 2 years ago

Thank you for your excellence support....

I will close this issue