Closed lshowway closed 2 years ago
Hmm, this is weird. The dataset should be processed into (sub)word-level.
Which code of LUKE are you using, legacy or allennlp?
TACRED dataset by provided Transformers libraries
Maybe the format of this dataset is incompatible with our code. Please make sure that the dataset file looks like this (taken from the dataset file we use).
[{'id': 'e7798fb926b9403cfcd2', 'docid': 'APW_ENG_20101103.0539', 'relation': 'per:title', 'token': ['At', 'the', 'same', 'time', ',', 'Chief', 'Financial', 'Officer', 'Douglas', 'Flint', 'will', 'become', 'chairman', ',', 'succeeding', 'Stephen', 'Green', 'who', 'is', 'leaving', 'to', 'take', 'a', 'government', 'job', '.'], 'subj_start': 8, 'subj_end': 9, 'obj_start': 12, 'obj_end': 12, 'subj_type': 'PERSON', 'obj_type': 'TITLE', 'stanford_pos': ['IN', 'DT', 'JJ', 'NN', ',', 'NNP', 'NNP', 'NNP', 'NNP', 'NNP', 'MD', 'VB', 'NN', ',', 'VBG', 'NNP', 'NNP', 'WP', 'VBZ', 'VBG', 'TO', 'VB', 'DT', 'NN', 'NN', '.'], 'stanford_ner': ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'PERSON', 'PERSON', 'O', 'O', 'O', 'O', 'O', 'PERSON', 'PERSON', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'], 'stanford_head': [4, 4, 4, 12, 12, 10, 10, 10, 10, 12, 12, 0, 12, 12, 12, 17, 15, 20, 20, 17, 22, 20, 25, 25, 22, 12], 'stanford_deprel': ['case', 'det', 'amod', 'nmod', 'punct', 'compound', 'compound', 'compound', 'compound', 'nsubj', 'aux', 'ROOT', 'xcomp', 'punct', 'xcomp', 'compound', 'dobj', 'nsubj', 'aux', 'acl:relcl', 'mark', 'xcomp', 'det', 'compound', 'dobj', 'punct']}]
@Ryou0634
Thanks for your detailed reply. That means the word the
should be tokenized into the
, not t h e
, right?
I am using transformers version, the faults may be induced by that I just use the dataset downloaded from ERNIE-THU.
Yeah, the word the
should be tokenized into the
, not t h e
,
I checked the data from ERNIE-THU. The difference is the format of input text.
The input data format that we expect is a list of tokens like
# in dev.json
{'token': ['At', 'the', 'same', 'time', ',', 'Chief', 'Financial', 'Officer', 'Douglas', 'Flint', 'will', 'become', 'chairman', ',', 'succeeding', 'Stephen', 'Green', 'who', 'is', 'leaving', 'to', 'take', 'a', 'government', 'job', '.']}
However, in the data you mentioned, the input sentences are in the string format.
# in dev.json
{'text': 'At the same time , Chief Financial Officer Douglas Flint will become chairman , succeeding Stephen Green who is leaving to take a government job .'}
You can modify here to adapt the code to that format. https://github.com/studio-ousia/luke/blob/eff5d0ae528c544aa1d6e7b51bfcd76992d266bf/examples/relation_classification/reader.py#L37
Also, if you want to use the ERNIE-THU data, be careful how to specify the entity positions.
Our code specifies entity positions by the position in tokens ({"subj_start": 8, "subj_end": 9, "obj_start": 12, "obj_end": 12}
), but the ERNIE-THU data by the position in string ({'ents': [['Access Industries', 43, 60, 0.5], ['Len Blavatnik', 107, 120, 0.5]]}
).
Thanks for your work.
I am testing LUKE on TACRED dataset by provided Transformers libraries. But I found that the text is processed in to char, other methods, e.g., K-Adapter, adopts word-level. Details as follows: converted input text:
Z a g a t S u r v e y , t h e g u i d e e m p i r e t h a t s t a r t e d a s a h o b b y f o r T i m a n d N i n a Z a g a t i n 1 9 7 9 ... .
tokenized input ids:
['<s>', 'ĠZ', '[HEAD]', 'Ġa', 'Ġg', 'Ġa', 'Ġt', '[HEAD]', 'Ġ', 'ĠS', 'Ġu', 'Ġr', 'Ġv', 'Ġe', 'Ġy', 'Ġ', 'Ġ', 'Ġ,', 'Ġ', 'Ġ', 'Ġt', 'Ġh', 'Ġe', 'Ġ', 'Ġ', 'Ġg', 'Ġu', 'Ġi', 'Ġd', 'Ġe', 'Ġ', 'Ġ', 'Ġe', 'Ġm', 'Ġp', 'Ġi', 'Ġr', 'Ġe', 'Ġ', 'Ġ', 'Ġt', 'Ġh', 'Ġa', 'Ġt', 'Ġ', 'Ġ', 'Ġs', 'Ġt', 'Ġa', 'Ġr', 'Ġt', 'Ġe', 'Ġd', 'Ġ', 'Ġ', 'Ġa', 'Ġs', 'Ġ', 'Ġ', 'Ġa', 'Ġ', 'Ġ', 'Ġh', 'Ġo', 'Ġb', 'Ġb', 'Ġy', 'Ġ', 'Ġ', 'Ġf', 'Ġo', 'Ġr', 'Ġ', 'Ġ', 'ĠT', 'Ġi', 'Ġm', 'Ġ', 'Ġ', 'Ġa', 'Ġn', 'Ġd', 'Ġ', 'Ġ', 'ĠN', 'Ġi', 'Ġn', 'Ġa', 'Ġ', 'Ġ', 'ĠZ', 'Ġa', 'Ġg', 'Ġa', 'Ġt', 'Ġ', 'Ġ', 'Ġi', 'Ġn', '[TAIL]'...
So, the TACRED is process into char-level? or It is on word-level?