hitz-zentroa / GoLLIE

Guideline following Large Language Model for Information Extraction
https://hitz-zentroa.github.io/GoLLIE/
Apache License 2.0
288 stars 19 forks source link

Preprocessing the ACE dataset. #18

Closed salokr closed 5 months ago

salokr commented 5 months ago

Hi, Thank you for uploading your code and awesome work to Git.

I have downloaded the ACE'05 dataset and would like to generate the code representation for it. Following your suggestions, I ran the following: python preprocess_ace.py -i <path_to_raw_ace_files> -o <output_dir> -s <path_to_ACE05-E> However, there were some issues with this so I made the following changes (line 915)

if language == "english":
        sgm_files = glob.glob(os.path.join(input_path, "*.sgm"))

(line 1171)

input_dir = args.input#os.path.join(args.input, args.lang.title())

after which I was able to run your code and get the following three files in the output directory:

dev.sentence.json
english.json
english.sentence.json
test.sentence.json
train.sentence.json

My first question: are the steps followed above correct?

If they are, next I run the following code (because I just want to run for the ACE dataset)

python -m src.generate_data \
     --configs \
        ${CONFIG_DIR}/ace_config.json \
     --output ${OUTPUT_DIR} \
     --overwrite_output_dir \
     --include_examples

but I get the following errors:

Traceback (most recent call last):
  File "/opt/sw/spack/apps/linux-centos8-x86_64/gcc-9.3.0/python-3.8.6-ff/lib/python3.8/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/opt/sw/spack/apps/linux-centos8-x86_64/gcc-9.3.0/python-3.8.6-ff/lib/python3.8/multiprocessing/pool.py", line 51, in starmapstar
    return list(itertools.starmap(args[0], args[1]))
  File "/scratch/ssrivas6/meta_events/GoLLIE-main/src/generate_data.py", line 35, in multicpu_generator
    dataloader = dataloader_cls(config["train_file"], **config)
  File "/scratch/ssrivas6/meta_events/GoLLIE-main/src/tasks/ace/data_loader.py", line 465, in __init__
    raise ValueError(f"Argument {event['event_type']}:{argument['role']} not found!")
ValueError: Argument Movement:Transport:Person not found!
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/sw/spack/apps/linux-centos8-x86_64/gcc-9.3.0/python-3.8.6-ff/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/sw/spack/apps/linux-centos8-x86_64/gcc-9.3.0/python-3.8.6-ff/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/scratch/ssrivas6/meta_events/GoLLIE-main/src/generate_data.py", line 246, in <module>
    main(args)
  File "/scratch/ssrivas6/meta_events/GoLLIE-main/src/generate_data.py", line 188, in main
    pool.starmap(generator_fn, enumerate(configs))
  File "/opt/sw/spack/apps/linux-centos8-x86_64/gcc-9.3.0/python-3.8.6-ff/lib/python3.8/multiprocessing/pool.py", line 372, in starmap
    return self._map_async(func, iterable, starmapstar, chunksize).get()
  File "/opt/sw/spack/apps/linux-centos8-x86_64/gcc-9.3.0/python-3.8.6-ff/lib/python3.8/multiprocessing/pool.py", line 771, in get
    raise self._value
ValueError: Argument Movement:Transport:Person not found!

How can I resolve this error?

osainz59 commented 5 months ago

You can find how we preprocessed ACE here: bash_scripts/preprocess_ace.sh

In principle that error arise because there is an instance of Movement:Transport event that has an argument of type Person, which is not defined in the guidelines. We did not face such an error when preprocessing ACE.

We did in fact some small adaptations to remove inconsistencies from annotations:

# Inconsistency between data and annotation guideline argument names
arg_name_mapping = {
    "ATTACK": {"Victim": "Target", "Agent": "Attacker"},
    "APPEAL": {"Plaintiff": "Prosecutor"},
    "PHONE-WRITE": {"Place": None},
}

Can you provide the id of the event that has that event-argument combination?

salokr commented 5 months ago

Hi,

Thank you for the swift response. I tried running the [bash_scripts/preprocess_ace.sh](https://github.com/hitz-zentroa/GoLLIE/issues/bash_scripts/preprocess_ace.sh) file but I get the following error:

src/dataset/ace_2005/data/**/timex2norm/*.sgm
src/dataset/ace_2005/data
Converting the dataset to JSON format
#SGM files: 0
0it [00:00, ?it/s]
Converting the dataset to OneIE format
Splitting the dataset into train/dev/test sets
Traceback (most recent call last):
  File "src/tasks/ace/preprocess_ace.py", line 1183, in <module>
    split_data(sentence_path, args.output, args.split)
  File "src/tasks/ace/preprocess_ace.py", line 1136, in split_data
    with open(os.path.join(split_path, "train.doc.txt")) as r:
FileNotFoundError: [Errno 2] No such file or directory: 'data/ace05/splits/train.doc.txt'

from where I can find the split files?

osainz59 commented 5 months ago

You can download the splits from the repo for the OneIE paper, here. Our preprocessing script is the same as theirs with minor tweaks.

salokr commented 5 months ago

nvm, I solved the issue. Thanks for the help and quick suggestions :) I will close the issue now.