Closed AlienKevin closed 2 years ago
@AlienKevin 呢度嘅數據都係從上游詞表度拉落嚟嘅,可唔可以從呢度檢查?https://github.com/CanCLID/rime-cantonese-upstream 你嘅正則表達式 ^(.*)(\n\1)+$
好似冇用,揾到 0 條結果
你講嘅係唔係指,同一個音/詞有幾種寫法呢個問題?呢個我哋暫時未得閒解決,因為需要人手指定一種標準寫法,而呢個標準寫法冇大眾共識,需要之後討論解決。
@laubonghaudoi 上游詞表我睇過係無問題嘅,但係唔知點解呢個repo嘅jyut6ping3.words.dict.yaml
就有重複。我指嘅重複唔係有多種寫法,係指寫法同讀音都一模一樣嘅情況:
唔該晒你解釋。目前我哋係攞上游詞表嚟作為唯一數據源嘅。如果出現呢種情況,應該係個 pull script 嗰度出咗問題,導致出現呢啲重複詞條。會唔會係 fetch_upstream.py 呢度嘅 bug?
好似個問題就喺fetch_upstream.py度 (link to problematic segment)。phrase_fragment.csv
同word.csv
有859個詞係共有嘅。The code loops over both files and do words_list.append((char, jyutping))
. Since words_list
is not a set, duplicated entries were added to the list.
phrase_fragment.csv
同 word.csv
係咪唔應該有重複嘅嘢?
冇錯,如果係噉嘅話,噉應該係我哋上游詞表嘅問題,要加個 validation script 檢查跨文件有冇重複詞條。我可以返去人手將呢幾百條都檢查一輪然後刪咗佢。至於呢個檢查腳本可唔可以麻煩 @ayaka14732 或者 @AlienKevin 幫手寫?
我可以試下寫個腳本
Here's a script that finds all duplicated lines within each CSV file as well as across all files. It outputs a TSV file containing the duplicated line and a list of two or more filenames with the line number of the duplicated lines.
A sample output looks like:
到未啊,dou3 mei6 aa3 [phrase_fragment.csv:1830, word.csv:11234]
Here's the script. It can be placed in the scripts
folder. When executed at the project root, it outputs a file called duplicates.tsv
.
from glob import iglob
line_to_locations = {}
for filename in iglob('*.csv'):
with open(filename) as f:
assert next(f).startswith('char,jyutping'), 'Invalid CSV header'
for line_num, line in enumerate(f, 2):
location = f'{filename}:{line_num}'
if line in line_to_locations:
line_to_locations[line].append(location)
else:
line_to_locations[line] = [location]
f = open('duplicates.tsv', 'w')
for line, locations in line_to_locations.items():
if len(locations) > 1:
locations_str = ', '.join(locations)
f.write(f'{line.strip()}\t[{locations_str}]\n')
f.close()
唔該晒,我返去將呢啲詞條刪咗一份佢,另外個驗證腳本搞掂之後可唔可以直接PR落上有詞表度?
Ok,我去submit個PR。有個問題,個phrase_fragment同word有咩分別,點解需要分成兩個文件?
啱先已經刪咗啲重複詞條 https://github.com/CanCLID/rime-cantonese-upstream/commit/a80ee507406660331e6e7f9d0c79b0ef4db36ed5 分成兩個文件係因為,有啲係詞彙有啲係句子或者文字碎片,兩者嘅用途有比較大差異,所以拆開。
哦,明白嘞
問題描述
I found 859 entries, like 吊頸都要唞下氣 and 多隻香爐多隻鬼, are duplicated with the same Chinese characters and Jyutping pronunciations in
jyut6ping3.words.dict.yaml
. You can find all duplicated lines by searching with this regex^(.*)(\n\1)+$
. Don't know if duplication is intended behavior?修改意見
Remove duplicated entries.