husseinmozannar / SOQAL

Arabic Open Domain Question Answering System using Neural Reading Comprehension
MIT License
159 stars 33 forks source link

SQuAD JSON format with combine_json_files function #15

Closed YousefGh closed 3 years ago

YousefGh commented 3 years ago

When using combine_json_files function at SOQAL.data_helpers.data_split with a SQuAD-like JSON format (DrQA format B), it will produce a JSON file that its 'data' name will have the keys of both files only as the 'data' value. This is due to this commented line part in combine_json_files function followed by a loop that iterates through dictionary keys (That's what happened when iterating through a dictionary):

data = json.load(f)#['data']
for article in data:
   combined_data.append(article)

The loop is not iterating through articles but is equal to doing this for key in data where data is actually the JSON root which will produce:

{"data": ["data", "version", "data", "version"], "version": "1.1"}
husseinmozannar commented 3 years ago

thanks for the catch, so for your data just uncommenting ['data'] fixed this issue right? I think I had to modify this for some special json files and forgot to fix it again.

YousefGh commented 3 years ago

Yes, it will fix the issue for any SQuAD like JSON. As it changes the iterator from dictionary (through keys) to an array (through articles inside 'data').

This is my validator:

a = QustionAnsweringJSON("a.json")
b = QustionAnsweringJSON("b.json")
ab = QustionAnsweringJSON("turk_combined_all.json")

a.show_info()
b.show_info()
ab.show_info()

Which outputs:

Number of Articles: 78
Number of Paragraphs: 234
Number of Questions: 702
Number of Answers: 702

Number of Articles: 77
Number of Paragraphs: 231
Number of Questions: 693
Number of Answers: 693

Number of Articles: 155
Number of Paragraphs: 465
Number of Questions: 1395
Number of Answers: 1395
husseinmozannar commented 3 years ago

thanks for this Yousef! I put a notice and fixed it