Much work has been done on the problem of gender prediction about English using the idea of probability models or traditional machine learning methods. Different from English or other alphabetic languages, Chinese characters are logosyllabic. Previous approaches work quite well for Indo-European languages in general and English in particular, however, their performance deteriorate in Asian languages such as Chinese, Japanese and Korean. In our work, we focus on Simplified Chinese characters and present a novel approach incorporating phonetic information (Pinyin) to enhance Chinese word embedding trained on BERT model. We compared our method with several previous methods, namely Naive Bayes, GBDT, and Random forest with word embedding via fastText as features. Quantitative and qualitative experiments demonstrate the superior of our model. The results show that we can achieve 93.45\% test accuracy using our method. In addition, we have released two large-scale gender-labeled datasets (one with over one million first names and the other with over six million full names) used as a part of this study for the community.
This paper was published during my graduate time at Beihang University. I cannot go back the University because of COVID-19. Luckily, I found some original data and put it on the Google Drive and Baidu Drive(pass:a2m0
). I will make it available when possible.
The codes will be available later.
The paepr was received by Springer last year. This year, Elian CARSENAT emailed me and corrected one web link mistake in my paper, so the the corrected version is here. Thanks Elian CARSENAT.