Open spookyQubit opened 4 years ago
Sorry for the late reply. @spookyQubit
Thank you for giving me the details of your approach. I'll implement it and make it possible to pre-process the Arabic data.
I think the error you mentioned is from the Python library (https://github.com/Lynten/stanford-corenlp). Why don't you try another interface library (https://github.com/stanfordnlp/python-stanford-corenlp) for the StandfordCoreNLP model?
Hi @bowbowbow, I was finally able to get rid of the Failed to load segmenter
error mentioned above. Instead of passing the properties to nlp.annotator
as a string, I passed it the StanfordCoreNLP-arabic.properties
file directly which did the trick. I had to make some changes to main.py to support Arabic. The diff is shown below:
diff --git a/main.py b/main.py
index f3ddd9e..f19c022 100644
--- a/main.py
+++ b/main.py
@@ -8,9 +8,20 @@ import argparse
from tqdm import tqdm
-def get_data_paths(ace2005_path):
+def get_arabic_properties():
+
+ arabic_properties = {'annotators': 'tokenize,ssplit,pos,lemma,parse',
+ 'tokenize.language': 'ar',
+ 'segment.model': 'edu/stanford/nlp/models/segmenter/arabic/arabic-segmenter-atb+bn+arztrain.ser.gz',
+ 'ssplit.boundaryTokenRegex': '[.]|[!?]+|[!\u061F]+',
+ 'pos.model': 'edu/stanford/nlp/models/pos-tagger/arabic/arabic.tagger',
+ 'parse.model': 'edu/stanford/nlp/models/lexparser/arabicFactored.ser.gz'}
+ return arabic_properties
+
+
+def get_data_paths(ace2005_path, mode_split_list='./data_list.csv'):
test_files, dev_files, train_files = [], [], []
- with open('./data_list.csv', mode='r') as csv_file:
+ with open(mode_split_list, mode='r') as csv_file:
rows = csv_file.readlines()
for row in rows[1:]:
items = row.replace('\n', '').split(',')
@@ -89,7 +100,7 @@ def verify_result(data):
print('Complete verification')
-def preprocessing(data_type, files):
+def preprocessing(data_type, files, lang='en'):
result = []
event_count, entity_count, sent_count, argument_count = 0, 0, 0, 0
@@ -109,7 +120,15 @@ def preprocessing(data_type, files):
data['golden-event-mentions'] = []
try:
- nlp_res_raw = nlp.annotate(item['sentence'], properties={'annotators': 'tokenize,ssplit,pos,lemma,parse'})
+ if lang == 'en':
+ nlp_res_raw = nlp.annotate(item['sentence'], properties={'annotators': 'tokenize,ssplit,pos,lemma,parse'})
+ elif lang == 'ar':
+ properties_ar = get_arabic_properties()
+ print(item['sentence'])
+ nlp_res_raw = nlp.annotate(item['sentence'], properties='./stanford-corenlp-full-2018-10-05/StanfordCoreNLP-arabic.properties')
+ print('done')
+ else:
+ raise NotImplementedError(f'Only en/ar supported. Got lang={lang}')
nlp_res = json.loads(nlp_res_raw)
except Exception as e:
print('[Warning] StanfordCore Exception: ', nlp_res_raw, 'This sentence will be ignored.')
@@ -131,7 +150,6 @@ def preprocessing(data_type, files):
data['pos-tags'] = list(map(lambda x: x['pos'], tokens))
data['lemma'] = list(map(lambda x: x['lemma'], tokens))
data['parse'] = nlp_res['sentences'][0]['parse']
-
sent_start_pos = item['position'][0]
for entity_mention in item['golden-entity-mentions']:
@@ -195,19 +213,23 @@ def preprocessing(data_type, files):
print('argument:', argument_count)
verify_result(result)
- with open('output/{}.json'.format(data_type), 'w') as f:
+ with open('output/{}.json'.format(data_type), 'w', encoding='utf-8') as f:
json.dump(result, f, indent=2)
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('--data', help="Path of ACE2005 English data", default='./data/ace_2005_td_v7/data/English')
+ parser.add_argument('--mode_split_list', help="csv containing train/dev/test spilts", default='./data_list.csv')
+ parser.add_argument('--lang', help="language, en/ar", default='en')
args = parser.parse_args()
- test_files, dev_files, train_files = get_data_paths(args.data)
+ test_files, dev_files, train_files = get_data_paths(args.data, args.mode_split_list)
+
+ print(get_arabic_properties())
- with StanfordCoreNLP('./stanford-corenlp-full-2018-10-05', memory='8g', timeout=60000) as nlp:
+ with StanfordCoreNLP('./stanford-corenlp-full-2018-10-05', memory='8g', timeout=600000000) as nlp:
# res = nlp.annotate('Donald John Trump is current president of the United States.', properties={'annotators': 'tokenize,ssplit,pos,lemma,parse'})
# print(res)
- preprocessing('dev', dev_files)
- preprocessing('test', test_files)
- preprocessing('train', train_files)
+ preprocessing('train', train_files, args.lang)
+ preprocessing('dev', dev_files, args.lang)
+ preprocessing('test', test_files, args.lang)
The problem now is that for Arabic, I keep getting CoreNLP request timed out
. This is after I increased the timeout to 600000000
! So, most of the Arabic sentences get dropped.
On the other hand, the preprocessor works beautifully for English.
Hi @bowbowbow, thanks a lot for putting this together. Was wondering if it will be easy to extend the content in main.py to support Arabic.
In my initial trials, I tried the following: 1) Created data_list_arabic.csv file to include train/dev/test splits. An example of the first few lines of the file looks like the following:
2) Built Arabic properties following the info in https://github.com/stanfordnlp/CoreNLP/blob/master/src/edu/stanford/nlp/pipeline/StanfordCoreNLP-arabic.properties:
3) Created the nlp_res_raw object as:
4) Downloaded the Arabic models:
Now when I run the script, I keep getting the following error:
Failed to load segmenter edu/stanford/nlp/models/segmenter/arabic/arabic-segmenter-atb+bn+arztrain.ser.gz
.I must be making a mistake somewhere of not downloading the correct package or pointing an env_variable to the correct location. Any help to add support for Arabic is greatly appreciated.