Some VN words were not parsed correctly

pomy66 commented 6 years ago

Dear Mr. PhuongLH,

Thank you for sharing this great resource for Vietnamese language NLP. I tested running the VITK with the test data below:

Quan điểm của Bộ Công Thương là nếu để làm thủy điện thì đây là dự án nhỏ nhằm tận dụng tài nguyên nước và được Chính phủ cho phép làm thì Bộ Công Thương ủng hộ, không để lãng phí nguồn nước
Trước đó, ngày 5/5, Vụ trưởng Vụ Giám sát và Thẩm định đầu tư cũng khẳng định, dự án giao thông thủy xuyên Á trên sông Hồng kết hợp thủy điện mới ở mức sơ khai, ý tưởng đề xuất.
Ông Nguyễn Xuân Tự
Nguyễn Xuân Tự
Theo đề xuất của chủ đầu tư, mục tiêu của dự án là sẽ mở ra một tuyến vận tải thông suốt trên sông Hồng.

The result were as followings:

Quan_điểm của Bộ Công_Thương la ̀ nếu để làm thuỷ_điện thì đây là dự_án nhỏ nhă ̀ m tận_dụng tài_nguyên nước và được Chính_phủ cho phép làm thì Bộ Công_Thương ủng_hộ , không để lãng_phí nguồn nước
Trước đó , nga ̀ y 5/5 , Vụ_trưởng Vụ_Giám_sát và Thẩm_định đầu_tư cu ̃ ng khă ̉ ng đi ̣ nh , dự_án giao_thông thuỷ xuyên Á trên sông Hồng kết_hợp thuỷ_điện mới ở mức sơ_khai , ý_tưởng đề_xuất .
Ông_Nguyễn_Xuân_Tự
Nguyễn_Xuân_Tự
Theo đê ̀ xuâ ́ t cu ̉ a chu ̉ đâ ̀ u tư , mu ̣ c tiêu của dự_án là sẽ mở ra một tuyến vận_tải thông_suốt trên sông Hồng .

Most of them were parsed impressively, however, some of them were parsed in a strange manner: e.g.

- ngày 5/5 --> nga ̀ y 5/5
- khẳng định --> khă ̉ ng đi ̣ nh
- đề xuất --> đê ̀ xuâ ́ t
- mục tiêu --> mu ̣ c tiêu

Kindly advise if I have mis-configure anything, or do I need to perform any further actions before I run the toolkit, in order to improve the outcome of the program.

Thanks again for your kind support.

Regards, Pomy66

phuonglh commented 6 years ago

Hi Pomy66,

I think that the problem comes from the wrong encoding of the input text. Please make sure that your raw text input is in UTF-8 encoding (pre-composed form). You may need to normalize your input first.

Cheers,

Phuong

pomy66 commented 6 years ago

Dear Mr. Phuong,

Thanks for your prompt response.

I'm quite sure that the file was saved under UTF-8 format. Thus, I attached my input file (news1a.txt) for your kind checking out.

By "normalize input file", could you please elaborate it?

Regards, Pomy66

news1a.txt

phuonglh commented 6 years ago

Please have a look at the section Unicode Normalization Forms of this page:

http://vietunicode.sourceforge.net/main.html

In particular:

"Precomposed characters are easier to handle and look better on displays and in print. They should be preferred over combining character sequences where available. NFC is the preferred way of encoding text in Unicode under Linux. The W3C Character Model for the World Wide Web also uses NFC for XML and related standards.

In computer programming context, the string length function in many modern programming languages, such as Java or C#, can return an unexpected number of characters for non-NFC strings. For instance, the length function operation on "ệ" returns 2 (if "ê"+"." or "ẹ"+"^") or 3 (if "e"+"^"+"." or "e"+"."+"^", being fully decomposed), which is correct, dependent of the case, but does not look consistent with the appearance of the string. When the string "ệ" is in NFC format, the length operation would consistently resolve to 1."

I suspect that although the text is in UTF-8, it contains non-NFC strings, therefore the accents are processed incorrectly.

pomy66 commented 6 years ago

Thanks! the pre-composed UTF-8 is definitely causing the problem. I tested uploading article from other source (vnexpress.net) and it seems to work just fine!

However, I encountered another problem with a long word, e.g.

Sở Kế hoạch và Đầu tư TP HCM --> Sở Kế_hoạch và Đầu_tư TP HCM

by which, we would expect

Sở_Kế_hoạch_và_Đầu_tư_TP_HCM or Sở_Kế_hoạch_và_Đầu_tư TP_HCM

Would you please advise if there is any place that we can inflence such behavior to the program?

Regards, Pomy66

news4.txt

part-00000.txt

phuonglh commented 6 years ago

Word tokenizer does not recognize named entities. It presents only separate words.

If you want to capture named entities of multiple words, you may want to try NER:

VitkNER: https://github.com/phuonglh/ai.vitk.ner
Or, for a nice demo: http://nnvlp.org/

pomy66 commented 6 years ago

Dear mr. Phuong,

The demo is truly impressed. Thanks for your good advise regarding NER. My curiosity continues to the tokenization of the following phrase.

Ông Phạm Mạnh Thắng - Phó tổng giám đốc Vietcombank

The tokenizer returned:

Ông Phạm_Mạnh_Thắng - Phó_tổng giám_đốc Vietcombank

but Chunking & NER returned

[NP Ông Phạm_Mạnh_Thắng] - [NP Phó_tổng_giám_đốc Vietcombank] Ông [PER Phạm_Mạnh_Thắng] - [ORG Phó_tổng_giám_đốc Vietcombank]

Does it mean that I need to run the NER first to exclude those Named Entities, then re run the tokenizer to construct TF-IDF to analyze the article? (provided that I need the program to regcognize "Phó_tổng_giám_đốc" as a single word, not "Phó_tổng" and "giám_đốc"

Regards, Pomy66

phuonglh / vn.vitk

Some VN words were not parsed correctly #21