multi-lingual evaluation?

As first steps to working on multilingual evaluation, one should:

Read the tutorials on implementing new tasks, features, and formats.
Get system outputs for different multilingual systems. Here are some potential sources:
- Translation outputs from the WMT shared tasks. These outputs are often available through the WMT metrics task.
- Summarization outputs from the XL-Sum dataset. @pfliu-nlp can help provide these.
- Various analysis tasks from XTREME. These are already imported into ExplainaBoard 1.0, so we can download the data from there. http://explainaboard.nlpedia.ai/leaderboard/xtreme/
Run the ExplainaBoard SDK over these tasks and generate reports.
Compare the reports across languages. See if we can extract any interesting insights about the cross-lingual variations of trends.
- If so, then dig deeper on these insights or write analysis/visualization code to make it easier to do these comparisons.
- If not, then we can improve the functionality of the ExplainaBoard SDK so that it extracts the features we need to do comparisons.
More systematically, we might also try correlating being good or bad at particular fine-grained analysis categories with a few things:
- Available training data, for example the size of crawled web corpora such as OSCAR or Wikipedia.
- Linguistic features of languages, or linguistic similarity between transfer and test languages, such as the analysis done in papers on choosing transfer languages or NLP performance prediction.

neulab / ExplainaBoard