Closed wqshmzh closed 2 days ago
Hi @wqshmzh,
We didn't measure F1 scores and acc through our experiments, so we do not have these values.
The code to compute the F1 score and acc is from the original ALFRED repo, so if you used seq2seq_im_mask.compute_metric
, then it may not be some code mistakes.
We empirically observe that getting higher F1 score or acc does not necessarily lead to higher SR in ALFRED tasks because these metrics does not take into account the error-recovery capability of the agents. So it might be better to run the evaluation in the simulator for a subset of the validation episodes, instead of getting these F1 and acc metrics.
Hope it helps.
Closing this due to inactivity. Feel free to reopen this if you have any questions.
It is quite time consuming to run the evaluation in the simulator. If you could provide the F1 scores and accs, it would be more convenient to compare the performance during training. By the way, I always get 70-80% F1 scores and 0% acc using the last trained model in seen and unseen validation. Is this the same with your experiments ? Maybe it is my code mistake that causes this. Please help me.