studio-ousia / luke

LUKE -- Language Understanding with Knowledge-based Embeddings
Apache License 2.0
705 stars 102 forks source link

Problems on TACRED dataset, is it char-based? #134

Closed lshowway closed 2 years ago

lshowway commented 2 years ago

Thanks for your work.

I am testing LUKE on TACRED dataset by provided Transformers libraries. But I found that the text is processed in to char, other methods, e.g., K-Adapter, adopts word-level. Details as follows: converted input text: Z a g a t S u r v e y , t h e g u i d e e m p i r e t h a t s t a r t e d a s a h o b b y f o r T i m a n d N i n a Z a g a t i n 1 9 7 9 ... .

tokenized input ids: ['<s>', 'ĠZ', '[HEAD]', 'Ġa', 'Ġg', 'Ġa', 'Ġt', '[HEAD]', 'Ġ', 'ĠS', 'Ġu', 'Ġr', 'Ġv', 'Ġe', 'Ġy', 'Ġ', 'Ġ', 'Ġ,', 'Ġ', 'Ġ', 'Ġt', 'Ġh', 'Ġe', 'Ġ', 'Ġ', 'Ġg', 'Ġu', 'Ġi', 'Ġd', 'Ġe', 'Ġ', 'Ġ', 'Ġe', 'Ġm', 'Ġp', 'Ġi', 'Ġr', 'Ġe', 'Ġ', 'Ġ', 'Ġt', 'Ġh', 'Ġa', 'Ġt', 'Ġ', 'Ġ', 'Ġs', 'Ġt', 'Ġa', 'Ġr', 'Ġt', 'Ġe', 'Ġd', 'Ġ', 'Ġ', 'Ġa', 'Ġs', 'Ġ', 'Ġ', 'Ġa', 'Ġ', 'Ġ', 'Ġh', 'Ġo', 'Ġb', 'Ġb', 'Ġy', 'Ġ', 'Ġ', 'Ġf', 'Ġo', 'Ġr', 'Ġ', 'Ġ', 'ĠT', 'Ġi', 'Ġm', 'Ġ', 'Ġ', 'Ġa', 'Ġn', 'Ġd', 'Ġ', 'Ġ', 'ĠN', 'Ġi', 'Ġn', 'Ġa', 'Ġ', 'Ġ', 'ĠZ', 'Ġa', 'Ġg', 'Ġa', 'Ġt', 'Ġ', 'Ġ', 'Ġi', 'Ġn', '[TAIL]'...

So, the TACRED is process into char-level? or It is on word-level?

ryokan0123 commented 2 years ago

Hmm, this is weird. The dataset should be processed into (sub)word-level.

Which code of LUKE are you using, legacy or allennlp?

TACRED dataset by provided Transformers libraries

Maybe the format of this dataset is incompatible with our code. Please make sure that the dataset file looks like this (taken from the dataset file we use).

[{'id': 'e7798fb926b9403cfcd2', 'docid': 'APW_ENG_20101103.0539', 'relation': 'per:title', 'token': ['At', 'the', 'same', 'time', ',', 'Chief', 'Financial', 'Officer', 'Douglas', 'Flint', 'will', 'become', 'chairman', ',', 'succeeding', 'Stephen', 'Green', 'who', 'is', 'leaving', 'to', 'take', 'a', 'government', 'job', '.'], 'subj_start': 8, 'subj_end': 9, 'obj_start': 12, 'obj_end': 12, 'subj_type': 'PERSON', 'obj_type': 'TITLE', 'stanford_pos': ['IN', 'DT', 'JJ', 'NN', ',', 'NNP', 'NNP', 'NNP', 'NNP', 'NNP', 'MD', 'VB', 'NN', ',', 'VBG', 'NNP', 'NNP', 'WP', 'VBZ', 'VBG', 'TO', 'VB', 'DT', 'NN', 'NN', '.'], 'stanford_ner': ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'PERSON', 'PERSON', 'O', 'O', 'O', 'O', 'O', 'PERSON', 'PERSON', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'], 'stanford_head': [4, 4, 4, 12, 12, 10, 10, 10, 10, 12, 12, 0, 12, 12, 12, 17, 15, 20, 20, 17, 22, 20, 25, 25, 22, 12], 'stanford_deprel': ['case', 'det', 'amod', 'nmod', 'punct', 'compound', 'compound', 'compound', 'compound', 'nsubj', 'aux', 'ROOT', 'xcomp', 'punct', 'xcomp', 'compound', 'dobj', 'nsubj', 'aux', 'acl:relcl', 'mark', 'xcomp', 'det', 'compound', 'dobj', 'punct']}]
lshowway commented 2 years ago

@Ryou0634 Thanks for your detailed reply. That means the word the should be tokenized into the, not t h e, right? I am using transformers version, the faults may be induced by that I just use the dataset downloaded from ERNIE-THU.

ryokan0123 commented 2 years ago

Yeah, the word the should be tokenized into the, not t h e,

I checked the data from ERNIE-THU. The difference is the format of input text.

The input data format that we expect is a list of tokens like

# in dev.json
{'token': ['At', 'the', 'same', 'time', ',', 'Chief', 'Financial', 'Officer', 'Douglas', 'Flint', 'will', 'become', 'chairman', ',', 'succeeding', 'Stephen', 'Green', 'who', 'is', 'leaving', 'to', 'take', 'a', 'government', 'job', '.']}

However, in the data you mentioned, the input sentences are in the string format.

# in dev.json
{'text': 'At the same time , Chief Financial Officer Douglas Flint will become chairman , succeeding Stephen Green who is leaving to take a government job .'}

You can modify here to adapt the code to that format. https://github.com/studio-ousia/luke/blob/eff5d0ae528c544aa1d6e7b51bfcd76992d266bf/examples/relation_classification/reader.py#L37

Also, if you want to use the ERNIE-THU data, be careful how to specify the entity positions. Our code specifies entity positions by the position in tokens ({"subj_start": 8, "subj_end": 9, "obj_start": 12, "obj_end": 12}), but the ERNIE-THU data by the position in string ({'ents': [['Access Industries', 43, 60, 0.5], ['Len Blavatnik', 107, 120, 0.5]]}).