Closed rain1024 closed 1 year ago
:bookmark: I'm writing the technical report of this experiment at overleaf. Please feel free to comments.
This is a very interesting and useful module!
I just have a few recommendations for the report:
Line 41: Text Normalization Tasks
Line 26: Wikipedia, the Vietnamese alphabet
Line 40: I think it would be nice to mention some diacritic rules from the Notes session:
Then discuss why this task is necessary, such as: "Today, many Vietnamese NLP models are trained using online news, social media, and spoken texts. These noisy data sources frequently contain misspelled texts that make sense to the reader but violate the official dictionary rules. Even in higher quality sources, such as official documents, there exists diacritic ambiguity that are difficult to solve. Specifically, Nguyen et al. (2019) describes a class of non-standard words (NSWs) that do not necessarily follow official rules. This includes loan words from another language, abbreviation, measurements, and compound words. Misspellings and inconsistency between the same words have a negative effect on data quality and hence the performance of many downstream tasks. In the pre-training stage, text normalization is important to obtain better results."
Line 44: Describe Character normalization and examples
Line 46: Good examples
Line 62: Wikipedia and period at the end
Line 69: shows our process
Line 100: Where is the Underthesea benchmark?
Line 115: Already mentioned in Figure 1 caption
Line 117: Consider a paragraph/section to discuss the dataset, since this is a contribution to Vietnamese NLP.
Line 122: Period
Figures: Table 3-5 are not discussed.
Conclusion: From my understanding, the module is purely rule-based? It fixes mispelled texts existed in the data but do not disambiguate or expand any text (ie. TP -> thร nh phแป). This could be something to discuss.
Great work! Let me know how I can help.
@taidnguyen Thanks for your kindly comments.
I found that we can works together directly on overleaf.
Please follow the link https://www.overleaf.com/1939692517xvrzhqqqthhd and put your comments there.
Thanks,
@rain1024 Done!
@taidnguyen I just edit the paper follow your suggestions. Please resolve comments that you think it's done.
Build Vietnamese text normalization module
1.3.5a4
1.3.5a3 (2022/08/12)