Text normalization module

rain1024 commented 2 years ago

Build Vietnamese text normalization module

🔖 Technical Report in Overleaf - WIP (Fell free to comments and contributions 👍 )

1.3.5a4

[x] Research more about rules for Punctuation standardization

1.3.5a3 (2022/08/12)

[x] Build and update underthesea API
[x] Update README.md
[x] Bechmark with other tools
[x] Write technical report
[x] Release underthesea version 1.3.5a3
[x] Update colab document

rain1024 commented 2 years ago

:bookmark: I'm writing the technical report of this experiment at overleaf. Please feel free to comments.

taidnguyen commented 2 years ago

This is a very interesting and useful module!

I just have a few recommendations for the report:

Line 41: Text Normalization Tasks

Line 26: Wikipedia, the Vietnamese alphabet

Line 40: I think it would be nice to mention some diacritic rules from the Notes session:

Then discuss why this task is necessary, such as: "Today, many Vietnamese NLP models are trained using online news, social media, and spoken texts. These noisy data sources frequently contain misspelled texts that make sense to the reader but violate the official dictionary rules. Even in higher quality sources, such as official documents, there exists diacritic ambiguity that are difficult to solve. Specifically, Nguyen et al. (2019) describes a class of non-standard words (NSWs) that do not necessarily follow official rules. This includes loan words from another language, abbreviation, measurements, and compound words. Misspellings and inconsistency between the same words have a negative effect on data quality and hence the performance of many downstream tasks. In the pre-training stage, text normalization is important to obtain better results."

Line 44: Describe Character normalization and examples

Line 46: Good examples

Line 62: Wikipedia and period at the end

Line 69: shows our process

Line 100: Where is the Underthesea benchmark?

Line 115: Already mentioned in Figure 1 caption

Line 117: Consider a paragraph/section to discuss the dataset, since this is a contribution to Vietnamese NLP.

Line 122: Period

Figures: Table 3-5 are not discussed.

Conclusion: From my understanding, the module is purely rule-based? It fixes mispelled texts existed in the data but do not disambiguate or expand any text (ie. TP -> thành phố). This could be something to discuss.

Great work! Let me know how I can help.

rain1024 commented 2 years ago

@taidnguyen Thanks for your kindly comments.

I found that we can works together directly on overleaf.

Please follow the link https://www.overleaf.com/1939692517xvrzhqqqthhd and put your comments there.

Thanks,

taidnguyen commented 2 years ago

@rain1024 Done!

rain1024 commented 2 years ago

@taidnguyen I just edit the paper follow your suggestions. Please resolve comments that you think it's done.

undertheseanlp / underthesea

Text normalization module #534