rime / rime-cantonese

Rime Cantonese input schema | 粵語拼音輸入方案
https://jyutping.net/
Creative Commons Attribution 4.0 International
551 stars 61 forks source link

請問可否從粵語 Wikipedia 匯入名詞? #35

Closed alex-the-man closed 4 years ago

alex-the-man commented 4 years ago

而家個詞庫有好多詞語都冇. 粵語 Wikipedia 入面有好多topic 個header都係valid 名詞. 請問從粵語 Wikipedia 匯入大量topic header 作爲名詞是不是一個好主意? 會否污染詞庫? 如果不會, 我可以做進一步研究.

laubonghaudoi commented 4 years ago

從粵維導入詞條可能會帶嚟一個副作用,就係影響原有詞嘅詞頻,尤其係聲母簡拼嗰陣,啲導入嘅非常用詞可能會排到前面,影響正常使用。我覺得你可以先試一下,將啲名詞全部導入落個jyut6ping3.phrase.yaml度,自己用下有冇乜大問題。如果冇嘅話可以新開個 pull request 我哋再merge入嚟,好多謝你嘅貢獻。

alex-the-man commented 4 years ago

請問可以將import嘅詞條set到最低頻率嗎? 另外我會睇吓wiki有無ranking. 啲太冷門嘅terms可以唔import.

laubonghaudoi commented 4 years ago

將導入嘅詞條設到最低頻率都冇用,因爲而家碼表入邊絕大部分詞條本身就係最低頻率,你導入啲詞條嘅詞頻只會同佢哋一樣。你可以即管試下噉樣效果點,如果得嘅話就開個pr入嚟。如果唔得嘅話,就要喺導入晒啲詞條之後將成個碼表嘅詞頻都重新做一次統計,噉就好大工作量嘞(雖然呢個都一直係我哋想做嘅嘢,但係大家都唔得閒冇精力搞,如果你有時間精力幫我哋解決嘅話我哋都非常歡迎)。

laubonghaudoi commented 4 years ago

@alex-the-man 請問你導入維基百科詞條有冇進展?如果冇反應嘅話我就關咗呢個 issue 㗎喇

alex-the-man commented 4 years ago

可以先close. 我而家寫緊啲python script 計詞頻. 搞掂先匯入詞語. 如果同詞頻有關嘅問題, 請問可以點搵到你哋? 例如 TG group? 唔該.

laubonghaudoi commented 4 years ago

辛苦晒你嘞,唔 close 都得,等到你搞掂咗直接喺度更新信息都好。我哋一直都會喺電報組入邊,你隨時@入邊嘅管理員就得。

tanxpyox commented 4 years ago

可以先close. 我而家寫緊啲python script 計詞頻

@alex-the-man 而家詞頻方面已經喺essay-cantonese處理好,你就噉push 個詞表過嚟就得嘞。(如果得嘅話可唔可以順便幫手睇下 Wikipedia 個發放條款,睇下我地噉做會唔會出事)

CC: 我諗 @William8915 係呢度最熟粵維嘅人,如果得嘅話我assign呢個issue畀你?

laubonghaudoi commented 4 years ago

@William8915 @alex-the-man 兩位而家仲有冇繼續搞呢個項目?如果冇嘅話我就先閂咗呢個 issue 嘞