about the training data format

leileilin commented 2 years ago

Hello, I'd like to ask about the .jsonlines file executived through convert to jsonlines. py, Can some attributes in the jsonlines file be successfully trained after being discarded? Such as speaker, pos.

vdobrovolskii commented 2 years ago

Do you mean, can you train without some of the attributes? You can totally replace every element in the "speaker" array by any string and the model will be able to learn. As for other keys, here are the ones that must be there for training (all the others are either legacy or needed only during preprocessing):

document_id:    str,                   # document name
cased_words:    List[str]              # words
sent_id:        List[int]              # word id to sent id mapping
part_id:        int.                   # document part id
speaker:        List[str]              # word id to speaker mapping

span_clusters:  List[List[List[int]]]  # list of clusters,
                                       #     each cluster is a list of spans
                                       #         each span is a list of two ints (start and end word ids)

head_clusters:  List[List[int]]        # list of clusters,
                                       #     each cluster is a list of span heads

head2span:      List[List[int]]        # list of training examples
                                       #     each example is a list of three ints
                                       #     head, span start, span end
                                       # this is used to train the model to predict spans from span heads

See this issue.

leileilin commented 2 years ago

Do you mean, can you train without some of the attributes? You can totally replace every element in the "speaker" array by any string and the model will be able to learn. As for other keys, here are the ones that must be there for training (all the others are either legacy or needed only during preprocessing):

document_id:    str,                   # document name
cased_words:    List[str]              # words
sent_id:        List[int]              # word id to sent id mapping
part_id:        int.                   # document part id
speaker:        List[str]              # word id to speaker mapping

span_clusters:  List[List[List[int]]]  # list of clusters,
                                       #     each cluster is a list of spans
                                       #         each span is a list of two ints (start and end word ids)

head_clusters:  List[List[int]]        # list of clusters,
                                       #     each cluster is a list of span heads

head2span:      List[List[int]]        # list of training examples
                                       #     each example is a list of three ints
                                       #     head, span start, span end
                                       # this is used to train the model to predict spans from span heads

See this issue.

thanks, So you mean that the attribute speaker cannot be discarded, right?

vdobrovolskii commented 2 years ago

it cannot be discarded, but it can be replaced with a placeholder value

leileilin commented 2 years ago

it cannot be discarded, but it can be replaced with a placeholder value

thanks, i got it.

leileilin commented 2 years ago

it cannot be discarded, but it can be replaced with a placeholder value

I have another new problem, I don't understand split_jsonlines function in convert_to_jsonlines.py use for? we can use mv command to transfer the .jsonlines file from temp dir to data dir.

vdobrovolskii / wl-coref

about the training data format #23