nlpcl-lab / ace2005-preprocessing

ACE 2005 corpus preprocessing for Event Extraction task
MIT License
291 stars 72 forks source link

Train Preprocess Error: start_idx != -1 #5

Closed kkkyan closed 5 years ago

kkkyan commented 5 years ago

您好, 在您最新commit的版本中,处理数据会出现错误,错误提示:

70%|████████████████████████████▌ | 368/529 [31:16<18:40, 6.96s/it][Warning] The entity in the other sentence is mentioned. This argument will be ignored. File "main.py", line 162, in preprocessing phrase=event_mention['trigger']['text'], File "main.py", line 37, in find_token_index assert start_idx != -1, "start_idx: {}, start_pos: {}, phrase: {}, tokens: {}".format(start_idx, start_pos, phrase, tokens) AssertionError: start_idx: -1, start_pos: -5, phrase: die, tokens: [{'index': 1, 'characterOffsetEnd': 3, 'characterOffsetBegin': 0, 'pos': 'WRB', 'word': 'How', 'lemma': 'how', 'originalText': 'How', 'before': '', 'after': ' '}, {'index': 2, 'characterOffsetEnd': 9, 'characterOffsetBegin': 4, 'pos': 'MD', 'word': 'would', 'lemma': 'would', 'originalText': 'would', 'before': ' ', 'after': ' '}, {'index': 3, 'characterOffsetEnd': 13, 'characterOffsetBegin': 10, 'pos': 'PRP', 'word': 'you', 'lemma': 'you', 'originalText': 'you', 'before': ' ', 'after': ' '}, {'index': 4, 'characterOffsetEnd': 19, 'characterOffsetBegin': 14, 'pos': 'VB', 'word': 'react', 'lemma': 'react', 'originalText': 'react', 'before': ' ', 'after': ' '}, {'index': 5, 'characterOffsetEnd': 22, 'characterOffsetBegin': 20, 'pos': 'TO', 'word': 'to', 'lemma': 'to', 'originalText': 'to', 'before': ' ', 'after': ' '}, {'index': 6, 'characterOffsetEnd': 27, 'characterOffsetBegin': 23, 'pos': 'PDT', 'word': 'such', 'lemma': 'such', 'originalText': 'such', 'before': ' ', 'after': ' '}, {'index': 7, 'characterOffsetEnd': 29, 'characterOffsetBegin': 28, 'pos': 'DT', 'word': 'a', 'lemma': 'a', 'originalText': 'a', 'before': ' ', 'after': ' '}, {'index': 8, 'characterOffsetEnd': 34, 'characterOffsetBegin': 30, 'pos': 'NN', 'word': 'call', 'lemma': 'call', 'originalText': 'call', 'before': ' ', 'after': ''}, {'index': 9, 'characterOffsetEnd': 35, 'characterOffsetBegin': 34, 'pos': '.', 'word': '?', 'lemma': '?', 'originalText': '?', 'before': '', 'after': ''}]

之前的版本没有问题,但是之前在电话处理上似乎entity识别有误

yaof20 commented 4 years ago

hello, was there a solution to this issue? AssertionError: end_idx: -1, end_pos: 133, phrase: Doctors Without Borders/Médecins Sans Frontières (MSF, tokens: [{'index': 1, 'word': '', 'originalText': '"', 'lemma': '',

orans3 commented 3 years ago

I also meet the same problem, and there is a [Warning] fail to find offset! (start_index: {}, text: {}, path: {})'.format(start_index, text, self.path) before this error, so I changed the offset: for i in range(0, 120),and the problem solved