nlpcl-lab / ace2005-preprocessing

ACE 2005 corpus preprocessing for Event Extraction task
MIT License
287 stars 72 forks source link

Support for Arabic #9

Open spookyQubit opened 4 years ago

spookyQubit commented 4 years ago

Hi @bowbowbow, thanks a lot for putting this together. Was wondering if it will be easy to extend the content in main.py to support Arabic.

In my initial trials, I tried the following: 1) Created data_list_arabic.csv file to include train/dev/test splits. An example of the first few lines of the file looks like the following:

type,path
train,nw/adj/ALH20001201.1900.0126
train,nw/adj/ALH20001201.1300.0071
dev,nw/adj/ALH20001128.1300.0081
test,nw/adj/ALH20001125.0700.0024
test,nw/adj/ALH20001124.1900.0127

2) Built Arabic properties following the info in https://github.com/stanfordnlp/CoreNLP/blob/master/src/edu/stanford/nlp/pipeline/StanfordCoreNLP-arabic.properties:

arabic_properties = {'annotators': 'tokenize,ssplit,pos,lemma,parse',
                         'tokenize.language': 'ar',
                         'segment.model': 'edu/stanford/nlp/models/segmenter/arabic/arabic-segmenter-atb+bn+arztrain.ser.gz',
                         'ssplit.boundaryTokenRegex': '[.]|[!?]+|[!\u061F]+',
                         'pos.model': 'edu/stanford/nlp/models/pos-tagger/arabic/arabic.tagger',
                         'parse.model': 'edu/stanford/nlp/models/lexparser/arabicFactored.ser.gz'}

3) Created the nlp_res_raw object as:

nlp_res_raw = nlp.annotate(item['sentence'], properties=arabic_properties)

4) Downloaded the Arabic models:

cd stanford-corenlp-full-2018-10-05
wget http://nlp.stanford.edu/software/stanford-arabic-corenlp-2018-02-27-models.jar

Now when I run the script, I keep getting the following error: Failed to load segmenter edu/stanford/nlp/models/segmenter/arabic/arabic-segmenter-atb+bn+arztrain.ser.gz.

I must be making a mistake somewhere of not downloading the correct package or pointing an env_variable to the correct location. Any help to add support for Arabic is greatly appreciated.

bowbowbow commented 4 years ago

Sorry for the late reply. @spookyQubit

Thank you for giving me the details of your approach. I'll implement it and make it possible to pre-process the Arabic data.

I think the error you mentioned is from the Python library (https://github.com/Lynten/stanford-corenlp). Why don't you try another interface library (https://github.com/stanfordnlp/python-stanford-corenlp) for the StandfordCoreNLP model?

spookyQubit commented 4 years ago

Hi @bowbowbow, I was finally able to get rid of the Failed to load segmenter error mentioned above. Instead of passing the properties to nlp.annotator as a string, I passed it the StanfordCoreNLP-arabic.properties file directly which did the trick. I had to make some changes to main.py to support Arabic. The diff is shown below:

diff --git a/main.py b/main.py
index f3ddd9e..f19c022 100644
--- a/main.py
+++ b/main.py
@@ -8,9 +8,20 @@ import argparse
 from tqdm import tqdm

-def get_data_paths(ace2005_path):
+def get_arabic_properties():
+
+    arabic_properties = {'annotators': 'tokenize,ssplit,pos,lemma,parse',
+                         'tokenize.language': 'ar',
+                         'segment.model': 'edu/stanford/nlp/models/segmenter/arabic/arabic-segmenter-atb+bn+arztrain.ser.gz',
+                         'ssplit.boundaryTokenRegex': '[.]|[!?]+|[!\u061F]+',
+                         'pos.model': 'edu/stanford/nlp/models/pos-tagger/arabic/arabic.tagger',
+                         'parse.model': 'edu/stanford/nlp/models/lexparser/arabicFactored.ser.gz'}
+    return arabic_properties
+
+
+def get_data_paths(ace2005_path, mode_split_list='./data_list.csv'):
     test_files, dev_files, train_files = [], [], []
-    with open('./data_list.csv', mode='r') as csv_file:
+    with open(mode_split_list, mode='r') as csv_file:
         rows = csv_file.readlines()
         for row in rows[1:]:
             items = row.replace('\n', '').split(',')
@@ -89,7 +100,7 @@ def verify_result(data):
     print('Complete verification')

-def preprocessing(data_type, files):
+def preprocessing(data_type, files, lang='en'):
     result = []
     event_count, entity_count, sent_count, argument_count = 0, 0, 0, 0

@@ -109,7 +120,15 @@ def preprocessing(data_type, files):
             data['golden-event-mentions'] = []

             try:
-                nlp_res_raw = nlp.annotate(item['sentence'], properties={'annotators': 'tokenize,ssplit,pos,lemma,parse'})
+                if lang == 'en':
+                    nlp_res_raw = nlp.annotate(item['sentence'], properties={'annotators': 'tokenize,ssplit,pos,lemma,parse'})
+                elif lang == 'ar':
+                    properties_ar = get_arabic_properties()
+                    print(item['sentence'])
+                    nlp_res_raw = nlp.annotate(item['sentence'], properties='./stanford-corenlp-full-2018-10-05/StanfordCoreNLP-arabic.properties')
+                    print('done')
+                else:
+                    raise NotImplementedError(f'Only en/ar supported. Got lang={lang}')
                 nlp_res = json.loads(nlp_res_raw)
             except Exception as e:
                 print('[Warning] StanfordCore Exception: ', nlp_res_raw, 'This sentence will be ignored.')
@@ -131,7 +150,6 @@ def preprocessing(data_type, files):
             data['pos-tags'] = list(map(lambda x: x['pos'], tokens))
             data['lemma'] = list(map(lambda x: x['lemma'], tokens))
             data['parse'] = nlp_res['sentences'][0]['parse']
-
             sent_start_pos = item['position'][0]

             for entity_mention in item['golden-entity-mentions']:
@@ -195,19 +213,23 @@ def preprocessing(data_type, files):
     print('argument:', argument_count)

     verify_result(result)
-    with open('output/{}.json'.format(data_type), 'w') as f:
+    with open('output/{}.json'.format(data_type), 'w', encoding='utf-8') as f:
         json.dump(result, f, indent=2)

 if __name__ == '__main__':
     parser = argparse.ArgumentParser()
     parser.add_argument('--data', help="Path of ACE2005 English data", default='./data/ace_2005_td_v7/data/English')
+    parser.add_argument('--mode_split_list', help="csv containing train/dev/test spilts", default='./data_list.csv')
+    parser.add_argument('--lang', help="language, en/ar", default='en')
     args = parser.parse_args()
-    test_files, dev_files, train_files = get_data_paths(args.data)
+    test_files, dev_files, train_files = get_data_paths(args.data, args.mode_split_list)
+
+    print(get_arabic_properties())

-    with StanfordCoreNLP('./stanford-corenlp-full-2018-10-05', memory='8g', timeout=60000) as nlp:
+    with StanfordCoreNLP('./stanford-corenlp-full-2018-10-05', memory='8g', timeout=600000000) as nlp:
         # res = nlp.annotate('Donald John Trump is current president of the United States.', properties={'annotators': 'tokenize,ssplit,pos,lemma,parse'})
         # print(res)
-        preprocessing('dev', dev_files)
-        preprocessing('test', test_files)
-        preprocessing('train', train_files)
+        preprocessing('train', train_files, args.lang)
+        preprocessing('dev', dev_files, args.lang)
+        preprocessing('test', test_files, args.lang)

The problem now is that for Arabic, I keep getting CoreNLP request timed out. This is after I increased the timeout to 600000000! So, most of the Arabic sentences get dropped.

On the other hand, the preprocessor works beautifully for English.