evaluate vs test - Githubissues

IreneSucameli commented 2 years ago

Hi, it's not clear the difference between the evaluate.py and test.py scripts in the NLU folder. What do they evaluate? Since the result obtained is completely different even if they take as input the same test set.

zqwerty commented 2 years ago

evaluate.py uses the unified interface inherited from NLU such as class BERTNLU(NLU). Each NLU model should provide such a class so that we can compare different models given the same inputs. test.py is the test script for BERTNLU only, which may have different preprocessing. However, the difference should not be large.

IreneSucameli commented 2 years ago

Ok, so evaluate.py is used to compare the performance of different NLU while if I want to test only BERTNLU I should use test.py? It is not clear to me why test.py calls the functions is_slot_da, calculate_F1 and recover_intent while evaluate doesn't do that. On what basis the overall performance is computed, in evaluate.py, if neither slots nor intents are recovered? Thks

zqwerty commented 2 years ago

Yes, evaluate.py is used to compare the performance of different NLU. it will be slower than test.py since it uses batch_size=1. If you want to test only BERTNLU (e.g., to tune some hyper-parameters) and do not compare with other NLU, you can use test.py for verification. The difference should not be large.

evaluate.py will call recover_intent in BERTNLU: https://github.com/thu-coai/ConvLab-2/blob/master/convlab2/nlu/jointBERT/multiwoz/nlu.py#L106. And calculate_F1 will be called by both evaluate.py and test.py -> https://github.com/thu-coai/ConvLab-2/blob/master/convlab2/nlu/jointBERT/multiwoz/postprocess.py#L13

is_slot_da decides whether a dialog act (intent, domain, slot, value) is non-categorical, which means the value is in the sentence and we use the slot-tagging method in BERTNLU to extract it (e.g., inform the name of a restaurant). if is_slot_da is False, we use [CLS] to do binary classification to judge if such a dialog act exists (e.g. request the name of a restaurant). We evaluate two kinds of dialog act and give slot F1 and intent F1 respectively. However, these metrics may not be applied to other NLU models such as a generative model, so it is not included in evaluate.py

In evaluate.py we directly evaluate the dialog act F1, comparing two lists of tuple (intent, domain, slot, value).

IreneSucameli commented 2 years ago

Hi, thanks for your reply. Could you please specify if in test.py the Recall, Precision and F1 score were micro-averaged?

zqwerty commented 2 years ago

Yes, they are. TP, FP, FN are accumulated through the test set

thu-coai / ConvLab-2

evaluate vs test #222