google-research / google-research

Google Research
https://research.google
Apache License 2.0
33.79k stars 7.82k forks source link

bug in VRDU evalution code #1882

Open minouei-kl opened 8 months ago

minouei-kl commented 8 months ago

https://github.com/google-research/google-research/blob/411cc95cb398c33046ab92028476bb38776965da/vrdu/benchmark_utils.py#L655

In VRDU dataset evaluation, the code mishandles missing ground truth. The check for its absence should precede other conditions to correctly set precision, recall, and F1 scores to -1 but the current implementation fail to do so. I have prepared a patch for this if I may create a pull request. https://github.com/minouei-kl/google-research/tree/minouei-kl-patch-1

minouei-kl commented 8 months ago

There is another issue with the evaluation of the line_items. Each line_item contains multiple keys, such as ('channel', 'program_start_date', 'program_end_date', 'program_desc', 'sub_amount'), but the order of these keys in the dataset is inconsistent. for instance in the sample "08b27cc0-dcaf-2f19-0c24-975a6a9c6e45.pdf" we can see two order of the keys in the same document. the function get_matching_result_per_doc disregards the match between the Ground Truth and the extraction if the order of the keys does not match. For example: gt_entity_item: (('sub_amount', 'channel', 'program_start_date', 'program_end_date', 'program_desc'), [(...)]) (('$450.00\n', (...), [...]), ('WWJ ', (...), [...]), ('05/27/20 ', (...), [...]), ('06/02/20 ', (...), [...]), ('The Late Show\n', (...), [...]))

ex_entity_item: ('channel', 'program_start_date', 'program_end_date', 'program_desc', 'sub_amount') (['WWJ'], ['05/27/20'], ['06/02/20'], ['The Late Show'], ['$450.00'])

which is just a weak method of assessment. This can be avoided by sorting the keys in both the Ground Truth and extraction before matching. I hack my way around in this branch: https://github.com/minouei-kl/google-research/tree/minouei-kl-patch-2

minouei-kl commented 7 months ago

the group_repeated_entities_into_nested_entities is completely wrong. this so disappointing that researchers manage to write such a messed up code. this function accept a list of entity names and their values and group the line_items by their occurrence. it works when all the keys are present if a line_item contains only two items in the middle of the list this function just put them in the wrong group and all the groups afterwards are wrong