Closed pfliu-nlp closed 2 years ago
@pfliu-nlp @neubig It is better to avoid any task-specific selection form get_default_tokenizer
: it should return task-agnostic, language-specific tokenizer to avoid unnecessary responsibility to the tasks. I think Processor
is the one who should know which tokenizer is applicable.
class FooProcessor(Processor):
def get_overall_stats(self):
...
tokenizer = self.get_tokenizer(lang)
def get_tokenizer(self, lang):
if lang == xx:
return XXToeknizer()
else:
return tokenizer.get_default_tokenizer(lang)
hi, @neubig great.
regarding "is_chinese_lang_code"
do we need some code like
if is_chinese_lang_code(lang):
return SacreBleuTokenizer(variety="zh")
It seems a little strange here.
@odashi
"It is better to avoid any task-specific selection form"
This is a good point.
do we need some code like
Yes, that's what I was thinking.
"It is better to avoid any task-specific selection form"
Yes, I agree with this too, let's do it that way!
@neubig
Do you assume that all languages
in CHINESE_MACRO_FAMILY
could be tokenized by SacreBleuTokenizer(variety="zh")
.
It seems that SacreBleu toolkit doesn't support this.
I guess it's a question of whether the SacreBLEU Chinese tokenizer or the SacreBLEU general purpose tokenizer is better for segmenting Min Dong Chinese, or Wu Chinese. Probably the SacreBLEU Chinese tokenizer, but I'm not a fluent Chinese speaker so I may be wrong :)
Thank you all for the comments!
Currently, the tokenizer for Chinese on other tasks (such as aspect-based sentiment classification) hasn't been turned on. This is just one way to fix this issue, feel free to make other suggestions.
Relevant issue: https://github.com/neulab/explainaboard_web/issues/298