11月29日版の郵便番号データを用いた郵便番号辞書生成でエラー

GoogleCodeExporter commented 9 years ago

What version of the product are you using? On what operating system?
mozc-1.12.1599.102
PCLinuxOS
python-2.7.5

Please provide any additional information below.
2013年11月29日版の郵便番号データを用いて mozc 
をビルドすると、郵便番号辞書生成スクリプトが下記のエ��
�ーを吐きます。(KEN_ALL.CSV 
は「読み仮名データの促音・拗音を小書きで表記するもの��
�を使用)

~/rpm/BUILD/mozc-1.12.1599.102
+ cd data/dictionary_oss
+ python ../../dictionary/gen_zip_code_seed.py --zip_code=KEN_ALL.CSV 
--jigyosyo=JIGYOSYO.CSV
Traceback (most recent call last):
  File "../../dictionary/gen_zip_code_seed.py", line 270, in <module>
    sys.exit(main())
  File "../../dictionary/gen_zip_code_seed.py", line 261, in main
    ProcessZipCodeCSV(options.zip_code)
  File "../../dictionary/gen_zip_code_seed.py", line 91, in ProcessZipCodeCSV
    for entry in ReadZipCodeEntries(tokens[2], tokens[6], tokens[7], tokens[8]):
  File "../../dictionary/gen_zip_code_seed.py", line 189, in ReadZipCodeEntries
    for town in ParseTownName(level3)]
  File "../../dictionary/gen_zip_code_seed.py", line 204, in ParseTownName
    % level3.encode('utf-8'))
AssertionError: failed to be merged 
大桑町（ア、イ、ヰ、ウ、上野、ヲ、オ乙、鐘搗山、上川��
�、上猫下、
error: Bad exit status from /var/tmp/rpm-tmp.1TepSE (%prep)

「大桑町（ア、イ、ヰ、ウ、上野、ヲ、オ乙、鐘搗山、上��
�原、上猫下、」の文字列で KEN_ALL.CSV 
内を検索すると、51673行目の下記がヒットしました。

17201,"92181","9218046","ｲｼｶﾜｹﾝ","ｶﾅｻﾞﾜｼ","ｵｵｸﾜ�
��ﾁ(ｱ､ｲ､ｲ､ｳ､ｳｴﾉ､ｵ､ｵｵﾂ､ｶﾈﾂｷﾔﾏ��
�ｶﾐｶﾜﾗ､ｶﾐﾈｺｼﾀ､","石川県","金沢市","大桑町（�
��、イ、ヰ、ウ、上野、ヲ、オ乙、鐘搗山、上川原、上猫下
、",1,0,0,1,1,5

試しにこの行を削除した KEN_ALL.CSV 
を使ってビルドしてみたところ、エラーは出ませんでした��
�
10月31日版の郵便番号データを用いたビルドではエラーは出�
��いませんでしたので、ひょっとすると11月29日版のデータ��
�側の問題かも分かりませんが、取り敢えず報告させて頂き�
��す。

Original issue reported on code.google.com by superhor...@gmail.com on 3 Dec 2013 at 5:56

Merged into: #272

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

情報ありがとうございます．
日本郵便株式会社で配布されている郵便番号データ2013年11��
�29日版および2013年12月27日版と，Mozc r178 
で現象を再現できました．

さしあたっては，
a. 2013年10月31日版の郵便番号データを使用し続ける
b. 日本郵便株式会社で配布されている 1 
次データではなく，整形された 2 次データを 
(利用許諾を確認したうえで) 入手して利用する．(例: 
http://zipcloud.ibsnet.co.jp )
のいずれかで対処していただければと思います．
いずれ Mozc の gen_zip_code_seed.py 
側でも対処されるかと思いますが，時期については未定で��
�．

日本郵便株式会社で公開されている郵便番号データに，機��
�的に処理することが困難なデータがいくつか混ざっている�
��め，将来にわたって問題を回避するのは難しいというのが
現状です．"ken_all.csv" 
で検索していただければこの問題について書かれた多数サ��
�トが見つかるかと思います．

Original comment by yukawa@google.com on 31 Dec 2013 at 9:47

GoogleCodeExporter commented 9 years ago

コメントありがとうございます。

日本郵便株式会社が配布している KEN_ALL.CSV 
には、幾つかの問題点があるようですね。
さし当たっては、加工済の郵便番号データを使っていこう��
�と考えています。
問題の回避策を提示して頂き、ありがとうございました。

Original comment by superhor...@gmail.com on 9 Jan 2014 at 9:52

GoogleCodeExporter commented 9 years ago

Issue 272 として再登録のうえ、r483 
にて修正を行いました。ご確認いただければ幸いです。

Original comment by yukawa@google.com on 17 Jan 2015 at 4:11

Changed state: Duplicate

GoogleCodeExporter commented 9 years ago

r483 
で「生データ」を用いて郵便番号辞書が問題なく生成でき��
�ことを確認致しました。
対応ありがとうございました。

Original comment by superhor...@gmail.com on 18 Jan 2015 at 3:18

ogata0916 / mozc

11月29日版の郵便番号データを用いた郵便番号辞書生成でエラー #205