Closed romainf28 closed 1 year ago
Hello @romainf28, thank you for your question!
We do plan to add interoperability between inseq
and ferret
soon, and I'll have forthcoming work addressing plausibility evaluation for sequence generation models.
In general, I think faithfulness evaluation can be naturally extended to sequence generation models in the way you mention, although I recommend using metrics accounting for the magnitude of attribution scores such as Soft-Comprehensiveness and Sufficiency (Zhao and Aletras, 2023). On the contrary, I think that the evaluation of plausibility should focus specifically on phenomena (i.e. specific tokens in the generation) for whose there is a human-understandable cue in the preceding context. AUPRC and MRR would be good choices in this latter context.
Hope this helps!
@romainf28 you might also want to check out our Discord server! Join link: https://discord.gg/V5VgwwFPbu. ferret
authors are also there, so it would be a great place to discuss such matters! :slightly_smiling_face:
Let me know if this answers your question, so that I can proceed to close the issue!
Thank you for your help @gsarti ! You can close the issue. I will join the discord server !
Question
Thanks for providing a very useful library for applying feature attributions on seq2seq models. I was wondering how you planned on integrating ferret evaluation metrics in Inseq. For instance, will you compute AOPC comprehensiveness for every generated token and then take the average of all the scores ? Or are you planning on designing completely new metrics for the evaluation of feature attributions on seq2seq models ?