qcri / LLMeBench

Benchmarking Large Language Models
76 stars 15 forks source link

Parsing NER task #61

Open fdalvi opened 1 year ago

fdalvi commented 1 year ago

Sometimes, the outputs are like:

{
  "label": "B-LOC B-LOC O B-PERS I-PERS O O O O B-PERS I-PERS O O O O O O O O O O O O O O O O O O B-LOC B-LOC O O O O O O O O O O O O B-PERS O O O O O O O O O O O O O O O O O O O O B-LOC B-LOC O O O O O O O",
  "model_output": "Output: [('الصالحية', 'LOC'), ('المفرق', 'LOC'), ('-', 'O'), ('غيث', 'PER'), ('الطراونة', 'PER'), ('-', 'O'), ('أمر', 'O'), ('جلالة', 'O'), ('الملك', 'PER'), ('عبدالله', 'PER'), ('الثاني', 'PER'), ('أمس', 'O'), ('بتنفيذ', 'O'), ('حزمة', 'O'), ('من',... ('التحديات', 'O'), ('التي', 'O'), ('يواجهها', 'O'), ('أبناء', 'O'), ('الصالحية', 'LOC'), ('ونايفة', 'LOC'), ('خصوصا', 'O'), ('فيما', 'O'), ('يتعلق', 'O'), ('بمشكلتي', 'O'), ('الفقر', 'O'), ('والبطالة', 'O'), ('.', 'O')]",
}

Should we count the LOC as "B-LOC"? What about consecutive ones, should the first one be "B-" and second one "I-" (this is not always correct, like the first two tokens in the above; Up for discussion @firojalam @baselmousi

baselmousi commented 1 year ago

Thanks for bringing this up. I will prepare output files to compare labels and returned post-processed responses for both gpt-3.5 and gpt-4. Considering 5 labels instead of 9 will improve the results quite a bit.