nlpcl-lab / ace2005-preprocessing

ACE 2005 corpus preprocessing for Event Extraction task
MIT License
288 stars 72 forks source link

exact sentence which caused 'end_idx = -1' issue #12

Open yaof20 opened 4 years ago

yaof20 commented 4 years ago

Hi there! Sorry for bothering again. I am using ace_2005_td_v7_LDC2006T06.tgz dataset and I have downloaded the latest version of this github repo.

During the processing of the training data, assertion error occurred: assert end_idx != -1, "end_idx: {}, end_pos: {}, phrase: {}, tokens: {}, chars:{}".format(end_idx, end_pos, phrase, tokens, chars) AssertionError: end_idx: -1, end_pos: 133, phrase: Doctors Without Borders/Médecins Sans Frontières (MSF, tokens: [{'index': 1, 'word': '', 'originalText': '"', 'lemma': '', 'characterOffsetBegin': 0,

I simply commented the assertion code and the main.py finished running without exception.

Here is what I found in the output file:

"sentence": "\"Doctors Without Borders/M\u8305decins Sans Fronti\u732bres (MSF) has received an extraordinary outpouring of support for the people of South Asia and we are extremely grateful.", "golden-entity-mentions": [

  {
    "text": "Doctors Without Borders/M\u00e9decins Sans Fronti\u00e8res (MSF",
    "entity-type": "ORG:Non-Governmental",
    "start": 12,
    **"end": -1**
  },...]

How to solve this end: -1 problem? The entity recognition could be incomplete.

Hanlard commented 4 years ago

I meet the same problem with you!

scarydemon2 commented 4 years ago

meet same problem with same data

scarydemon2 commented 4 years ago

you can change the raw data that in Engish/un/timex2norm/alt.vacation.las-vegas_20050109.0133.apf.xml and alt.vacation.las-vegas_20050109.0133.sgm. In this two files,you can search "Doctors Without" and change following é to e .and the problem will solve.

daviddongkc commented 3 years ago

Hi,

I am doing research on information extraction and need to use ACE2005 dataset urgently. But unfortunately, the LDC licence for ACE2005 is not available for my university. May I know if you can by any chances share the dataset for research purpose?

Many thanks, Regards, kc

yaof20 commented 3 years ago

Hi,

I am doing research on information extraction and need to use ACE2005 dataset urgently. But unfortunately, the LDC licence for ACE2005 is not available for my university. May I know if you can by any chances share the dataset for research purpose?

Many thanks, Regards, kc

Hi there,

sorry for the late response. I am wondering if you are still in need of the dataset. Contact me through email (fengya0@outlook.com) if you are still interested.

Regards, Feng Yao

zyz0000 commented 2 years ago

you can change the raw data that in Engish/un/timex2norm/alt.vacation.las-vegas_20050109.0133.apf.xml and alt.vacation.las-vegas_20050109.0133.sgm. In this two files,you can search "Doctors Without" and change following é to e .and the problem will solve.

In addition to change é to e, one should also change è to e to solve the problem.