vncorenlp / VnCoreNLP

A Vietnamese natural language processing toolkit (NAACL 2018)
Other
587 stars 145 forks source link

Tone marks are changed in Vietnamese #39

Closed tdt98 closed 3 years ago

tdt98 commented 3 years ago

Dear @datquocnguyen, thank you for sharing your great work. I just obsereved an abnormality in VNCoreNLP with word segmentation. With the input "Hòa", I received "Hoà", that means the "`" tone mask is shifted one character. Could you please fix this problem or provide solution in the future. Thank you in advance.

datquocnguyen commented 3 years ago

Hi, It's not an abnormality, i.e. it's not an issue/problem to be fixed. We use a normalization step to handle various outputs of different typing methods on different OSs. See https://github.com/vncorenlp/VnCoreNLP/blob/687822d3b40dc9002d7205b9067c6817fa40ed34/src/main/java/vn/corenlp/wordsegmenter/Utils.java#L105 Regarding your question, you can thus simply write a short post-processing script to reverse that normalization step on the VnCoreNLP's output.

datquocnguyen commented 2 years ago

Chuẩn hóa cách gõ dấu câu về kiểu gõ cũ: https://gist.github.com/nguyenvanhieuvn/72ccf3ddf7d179b281fdae6c0b84942b