Chinese word segmentation

dan-zeman commented 8 years ago

Just to clarify (I am not going to put it in the proposal but we will have to decide it later):

Are we going to require that people do word segmentation in Chinese (and Japanese, Thai etc. if these languages are added to UD)? It would be in line with our End-to-End philosophy but it is obviously harder than learning that "aux" = "à le".

UDPipe is probably not going to help here, right, @foxik? I think there are neither SpaceAfter=No, nor multi-word tokens, so UDPipe would need an option to consider every sentence a huge multi-word token. But even then I suspect the accuracy will not be great, unless it does something Chinese-specific.

foxik commented 8 years ago

Personally I would require word segmentation for Chinese (and similar) too.

My idea was to train the Chinese UDPipe tokenizer as if there were SpaceAfter=No on every token (there will be a switch in UDPipe); and of course use much higher character embedding dimension. I can tell you the resulting numbers "soon" (in a month).

dan-zeman commented 8 years ago

OK, sounds good. I agree that not requiring it would be a step back from the end-to-end approach.

ufal / conll2017

Chinese word segmentation #13