Closed pomy66 closed 6 years ago
Hi Pomy66,
I think that the problem comes from the wrong encoding of the input text. Please make sure that your raw text input is in UTF-8 encoding (pre-composed form). You may need to normalize your input first.
Cheers,
Phuong
Dear Mr. Phuong,
Thanks for your prompt response.
I'm quite sure that the file was saved under UTF-8 format. Thus, I attached my input file (news1a.txt) for your kind checking out.
By "normalize input file", could you please elaborate it?
Regards, Pomy66
Please have a look at the section Unicode Normalization Forms of this page:
http://vietunicode.sourceforge.net/main.html
In particular:
"Precomposed characters are easier to handle and look better on displays and in print. They should be preferred over combining character sequences where available. NFC is the preferred way of encoding text in Unicode under Linux. The W3C Character Model for the World Wide Web also uses NFC for XML and related standards.
In computer programming context, the string length function in many modern programming languages, such as Java or C#, can return an unexpected number of characters for non-NFC strings. For instance, the length function operation on "ệ" returns 2 (if "ê"+"." or "ẹ"+"^") or 3 (if "e"+"^"+"." or "e"+"."+"^", being fully decomposed), which is correct, dependent of the case, but does not look consistent with the appearance of the string. When the string "ệ" is in NFC format, the length operation would consistently resolve to 1."
I suspect that although the text is in UTF-8, it contains non-NFC strings, therefore the accents are processed incorrectly.
Thanks! the pre-composed UTF-8 is definitely causing the problem. I tested uploading article from other source (vnexpress.net) and it seems to work just fine!
However, I encountered another problem with a long word, e.g.
Sở Kế hoạch và Đầu tư TP HCM --> Sở Kế_hoạch và Đầu_tư TP HCM
by which, we would expect
Sở_Kế_hoạch_và_Đầu_tư_TP_HCM or Sở_Kế_hoạch_và_Đầu_tư TP_HCM
Would you please advise if there is any place that we can inflence such behavior to the program?
Regards, Pomy66
Word tokenizer does not recognize named entities. It presents only separate words.
If you want to capture named entities of multiple words, you may want to try NER:
Dear mr. Phuong,
The demo is truly impressed. Thanks for your good advise regarding NER. My curiosity continues to the tokenization of the following phrase.
Ông Phạm Mạnh Thắng - Phó tổng giám đốc Vietcombank
The tokenizer returned:
Ông Phạm_Mạnh_Thắng - Phó_tổng giám_đốc Vietcombank
but Chunking & NER returned
[NP Ông Phạm_Mạnh_Thắng] - [NP Phó_tổng_giám_đốc Vietcombank] Ông [PER Phạm_Mạnh_Thắng] - [ORG Phó_tổng_giám_đốc Vietcombank]
Does it mean that I need to run the NER first to exclude those Named Entities, then re run the tokenizer to construct TF-IDF to analyze the article? (provided that I need the program to regcognize "Phó_tổng_giám_đốc" as a single word, not "Phó_tổng" and "giám_đốc"
Regards, Pomy66
Dear Mr. PhuongLH,
Thank you for sharing this great resource for Vietnamese language NLP. I tested running the VITK with the test data below:
The result were as followings:
Most of them were parsed impressively, however, some of them were parsed in a strange manner: e.g.
Kindly advise if I have mis-configure anything, or do I need to perform any further actions before I run the toolkit, in order to improve the outcome of the program.
Thanks again for your kind support.
Regards, Pomy66