Some fixes to GlobalBench NER

This makes two fixes to make the GlobalBench NER model work properly:

Language defaults to English and a warning is logged when a language isn't properly specified in DataLab
We default to taking the maximum of the evaluation metrics when multiple evaluation metrics with the same name appear in a single system (this is the case for NER, where we have example-level and token-level F1, which are actually identical)

neulab / explainaboard_web