enable Chinese tokenizer to work

neulab / ExplainaBoard

Interpretable Evaluation for AI Systems

MIT License

359 stars 36 forks source link

enable Chinese tokenizer to work #483

Closed pfliu-nlp closed 2 years ago

pfliu-nlp commented 2 years ago

Currently, the tokenizer for Chinese on other tasks (such as aspect-based sentiment classification) hasn't been turned on. This is just one way to fix this issue, feel free to make other suggestions.

Relevant issue: https://github.com/neulab/explainaboard_web/issues/298

odashi commented 2 years ago

@pfliu-nlp @neubig It is better to avoid any task-specific selection form get_default_tokenizer: it should return task-agnostic, language-specific tokenizer to avoid unnecessary responsibility to the tasks. I think Processor is the one who should know which tokenizer is applicable.

class FooProcessor(Processor):
    def get_overall_stats(self):
        ...
        tokenizer = self.get_tokenizer(lang)

    def get_tokenizer(self, lang):
        if lang == xx:
            return XXToeknizer()
        else:
            return tokenizer.get_default_tokenizer(lang)

pfliu-nlp commented 2 years ago

hi, @neubig great.

regarding "is_chinese_lang_code"

do we need some code like

if is_chinese_lang_code(lang):
        return SacreBleuTokenizer(variety="zh")

It seems a little strange here.

@odashi

"It is better to avoid any task-specific selection form"

This is a good point.

neubig commented 2 years ago

do we need some code like

Yes, that's what I was thinking.

"It is better to avoid any task-specific selection form"

Yes, I agree with this too, let's do it that way!

pfliu-nlp commented 2 years ago

@neubig Do you assume that all languages in CHINESE_MACRO_FAMILY could be tokenized by SacreBleuTokenizer(variety="zh"). It seems that SacreBleu toolkit doesn't support this.

neubig commented 2 years ago

I guess it's a question of whether the SacreBLEU Chinese tokenizer or the SacreBLEU general purpose tokenizer is better for segmenting Min Dong Chinese, or Wu Chinese. Probably the SacreBLEU Chinese tokenizer, but I'm not a fluent Chinese speaker so I may be wrong :)

pfliu-nlp commented 2 years ago

Thank you all for the comments!