textlint-ja / textlint-rule-no-doubled-joshi

文中に同じ助詞が複数出てくるのをチェックするtextlintルール
MIT License
22 stars 4 forks source link

異なる種類の助詞の重複を許したい #6

Closed takahashim closed 8 years ago

takahashim commented 8 years ago

『ターミナルで「test」入力する』を与えると「一文に二回以上利用されている助詞 "と" がみつかりました」のエラーが出ますが、1個目の「と」は格助詞、2個めの「と」は接続助詞です。このような場合は重複を許したいです。

azu commented 8 years ago

品詞細分類(pos_detail_1)まで見たほうがよさそうですね。 (テストケースとなるサンプルをもっと手軽に増やせると良さそう…)

$ npm i -g kuromoji-cli
$ kuromoji "ターミナルで「test」と入力すると"
[
    {
        "word_id": 434620,
        "word_type": "KNOWN",
        "word_position": 1,
        "surface_form": "ターミナル",
        "pos": "名詞",
        "pos_detail_1": "一般",
        "pos_detail_2": "*",
        "pos_detail_3": "*",
        "conjugated_type": "*",
        "conjugated_form": "*",
        "basic_form": "ターミナル",
        "reading": "ターミナル",
        "pronunciation": "ターミナル"
    },
    {
        "word_id": 2594250,
        "word_type": "KNOWN",
        "word_position": 6,
        "surface_form": "で",
        "pos": "助詞",
        "pos_detail_1": "格助詞",
        "pos_detail_2": "一般",
        "pos_detail_3": "*",
        "conjugated_type": "*",
        "conjugated_form": "*",
        "basic_form": "で",
        "reading": "デ",
        "pronunciation": "デ"
    },
    {
        "word_id": 2613610,
        "word_type": "KNOWN",
        "word_position": 7,
        "surface_form": "「",
        "pos": "記号",
        "pos_detail_1": "括弧開",
        "pos_detail_2": "*",
        "pos_detail_3": "*",
        "conjugated_type": "*",
        "conjugated_form": "*",
        "basic_form": "「",
        "reading": "「",
        "pronunciation": "「"
    },
    {
        "word_id": 120,
        "word_type": "UNKNOWN",
        "word_position": 8,
        "surface_form": "test",
        "pos": "名詞",
        "pos_detail_1": "固有名詞",
        "pos_detail_2": "組織",
        "pos_detail_3": "*",
        "conjugated_type": "*",
        "conjugated_form": "*",
        "basic_form": "*"
    },
    {
        "word_id": 2611700,
        "word_type": "KNOWN",
        "word_position": 12,
        "surface_form": "」",
        "pos": "記号",
        "pos_detail_1": "括弧閉",
        "pos_detail_2": "*",
        "pos_detail_3": "*",
        "conjugated_type": "*",
        "conjugated_form": "*",
        "basic_form": "」",
        "reading": "」",
        "pronunciation": "」"
    },
    {
        "word_id": 2595020,
        "word_type": "KNOWN",
        "word_position": 13,
        "surface_form": "と",
        "pos": "助詞",
        "pos_detail_1": "格助詞",
        "pos_detail_2": "引用",
        "pos_detail_3": "*",
        "conjugated_type": "*",
        "conjugated_form": "*",
        "basic_form": "と",
        "reading": "ト",
        "pronunciation": "ト"
    },
    {
        "word_id": 2567130,
        "word_type": "KNOWN",
        "word_position": 14,
        "surface_form": "入力",
        "pos": "名詞",
        "pos_detail_1": "サ変接続",
        "pos_detail_2": "*",
        "pos_detail_3": "*",
        "conjugated_type": "*",
        "conjugated_form": "*",
        "basic_form": "入力",
        "reading": "ニュウリョク",
        "pronunciation": "ニューリョク"
    },
    {
        "word_id": 3168910,
        "word_type": "KNOWN",
        "word_position": 16,
        "surface_form": "する",
        "pos": "動詞",
        "pos_detail_1": "自立",
        "pos_detail_2": "*",
        "pos_detail_3": "*",
        "conjugated_type": "サ変・スル",
        "conjugated_form": "基本形",
        "basic_form": "する",
        "reading": "スル",
        "pronunciation": "スル"
    },
    {
        "word_id": 2594810,
        "word_type": "KNOWN",
        "word_position": 18,
        "surface_form": "と",
        "pos": "助詞",
        "pos_detail_1": "接続助詞",
        "pos_detail_2": "*",
        "pos_detail_3": "*",
        "conjugated_type": "*",
        "conjugated_form": "*",
        "basic_form": "と",
        "reading": "ト",
        "pronunciation": "ト"
    }
]
azu commented 8 years ago

7 に実装したPRを出しました。

azu commented 8 years ago

実装では単純に各tokenのpos_detail_1もキーにして比較するようにしています。 ただ、pos_detail_1,pos_detail_2,pos_detail_3は単純により詳細な分類があるなら順番に1,2,3と入れていってるような箱に見えるので、単純にpos_detail_1を見るとおかしくなるケースがあったりするのかも。(入れる順番は必ず決まっているとは思うので、1だけなら問題なさそうだけど2,3も見ようとするなら多分キーをMap的な順不同な構造に変える必要がありそう)

takahashim commented 8 years ago

7 ありがとうございます!

kuromoji.jsのpos_detail_1はmecabのIPADIC由来だそうで(http://stp-the-wld.blogspot.jp/2015/01/javascriptkuromojijs.html )、これはIPA品詞体系を元にしているようです。

この通りであれば、助詞についてはpos_detail_1だけを見ればよさそうです。

azu commented 8 years ago

@takahashim なるほど。ありがとうございます。

azu commented 8 years ago

マージして3.2.0としてリリースしました

takahashim commented 8 years ago

ありがとうございました!

naskya commented 1 year ago

$A \coloneqq B + C$ と置くと、 $f(A) = 0$ が成り立つ。

という感じの文を書いたときに「と置くと」の部分でまだ怒られましたので、ご参考までに。 (このような表現はよく現れると思います)

azu commented 1 year ago

@naskya 別の原因な可能性があるのでIssueを作ってもらえると助かります。

Math記法が何か関係してそうな気がしなくはないので、プレーンなテキストとして再現するものがあると助かります。

B+Cと置くと、A=Bが成り立つ。

とした場合は 品詞細分類1(接続助詞と格助詞) が異なるので再現できませんでした https://azu.github.io/morpheme-match/?text=B+C%E3%81%A8%E7%BD%AE%E3%81%8F%E3%81%A8%E3%80%81A=B%E3%81%8C%E6%88%90%E3%82%8A%E7%AB%8B%E3%81%A4%E3%80%82

`$A \coloneqq B + C$ と置くと、 $f(A) = 0$ が成り立つ。` もテストケースでは再現できなかった ```json [ { "word_id": 80, "word_type": "UNKNOWN", "word_position": 1, "surface_form": "$", "pos": "名詞", "pos_detail_1": "サ変接続", "pos_detail_2": "*", "pos_detail_3": "*", "conjugated_type": "*", "conjugated_form": "*", "basic_form": "*" }, { "word_id": 120, "word_type": "UNKNOWN", "word_position": 2, "surface_form": "A", "pos": "名詞", "pos_detail_1": "固有名詞", "pos_detail_2": "組織", "pos_detail_3": "*", "conjugated_type": "*", "conjugated_form": "*", "basic_form": "*" }, { "word_id": 10, "word_type": "UNKNOWN", "word_position": 3, "surface_form": " ", "pos": "記号", "pos_detail_1": "空白", "pos_detail_2": "*", "pos_detail_3": "*", "conjugated_type": "*", "conjugated_form": "*", "basic_form": "*" }, { "word_id": 80, "word_type": "UNKNOWN", "word_position": 4, "surface_form": "\\", "pos": "名詞", "pos_detail_1": "サ変接続", "pos_detail_2": "*", "pos_detail_3": "*", "conjugated_type": "*", "conjugated_form": "*", "basic_form": "*" }, { "word_id": 120, "word_type": "UNKNOWN", "word_position": 5, "surface_form": "coloneqq", "pos": "名詞", "pos_detail_1": "固有名詞", "pos_detail_2": "組織", "pos_detail_3": "*", "conjugated_type": "*", "conjugated_form": "*", "basic_form": "*" }, { "word_id": 10, "word_type": "UNKNOWN", "word_position": 13, "surface_form": " ", "pos": "記号", "pos_detail_1": "空白", "pos_detail_2": "*", "pos_detail_3": "*", "conjugated_type": "*", "conjugated_form": "*", "basic_form": "*" }, { "word_id": 120, "word_type": "UNKNOWN", "word_position": 14, "surface_form": "B", "pos": "名詞", "pos_detail_1": "固有名詞", "pos_detail_2": "組織", "pos_detail_3": "*", "conjugated_type": "*", "conjugated_form": "*", "basic_form": "*" }, { "word_id": 10, "word_type": "UNKNOWN", "word_position": 15, "surface_form": " ", "pos": "記号", "pos_detail_1": "空白", "pos_detail_2": "*", "pos_detail_3": "*", "conjugated_type": "*", "conjugated_form": "*", "basic_form": "*" }, { "word_id": 80, "word_type": "UNKNOWN", "word_position": 16, "surface_form": "+", "pos": "名詞", "pos_detail_1": "サ変接続", "pos_detail_2": "*", "pos_detail_3": "*", "conjugated_type": "*", "conjugated_form": "*", "basic_form": "*" }, { "word_id": 10, "word_type": "UNKNOWN", "word_position": 17, "surface_form": " ", "pos": "記号", "pos_detail_1": "空白", "pos_detail_2": "*", "pos_detail_3": "*", "conjugated_type": "*", "conjugated_form": "*", "basic_form": "*" }, { "word_id": 120, "word_type": "UNKNOWN", "word_position": 18, "surface_form": "C", "pos": "名詞", "pos_detail_1": "固有名詞", "pos_detail_2": "組織", "pos_detail_3": "*", "conjugated_type": "*", "conjugated_form": "*", "basic_form": "*" }, { "word_id": 80, "word_type": "UNKNOWN", "word_position": 19, "surface_form": "$", "pos": "名詞", "pos_detail_1": "サ変接続", "pos_detail_2": "*", "pos_detail_3": "*", "conjugated_type": "*", "conjugated_form": "*", "basic_form": "*" }, { "word_id": 10, "word_type": "UNKNOWN", "word_position": 20, "surface_form": " ", "pos": "記号", "pos_detail_1": "空白", "pos_detail_2": "*", "pos_detail_3": "*", "conjugated_type": "*", "conjugated_form": "*", "basic_form": "*" }, { "word_id": 92760, "word_type": "KNOWN", "word_position": 21, "surface_form": "と", "pos": "助詞", "pos_detail_1": "格助詞", "pos_detail_2": "引用", "pos_detail_3": "*", "conjugated_type": "*", "conjugated_form": "*", "basic_form": "と", "reading": "ト", "pronunciation": "ト" }, { "word_id": 3830190, "word_type": "KNOWN", "word_position": 22, "surface_form": "置く", "pos": "動詞", "pos_detail_1": "自立", "pos_detail_2": "*", "pos_detail_3": "*", "conjugated_type": "五段・カ行イ音便", "conjugated_form": "基本形", "basic_form": "置く", "reading": "オク", "pronunciation": "オク" }, { "word_id": 92550, "word_type": "KNOWN", "word_position": 24, "surface_form": "と", "pos": "助詞", "pos_detail_1": "接続助詞", "pos_detail_2": "*", "pos_detail_3": "*", "conjugated_type": "*", "conjugated_form": "*", "basic_form": "と", "reading": "ト", "pronunciation": "ト" }, { "word_id": 90910, "word_type": "KNOWN", "word_position": 25, "surface_form": "、", "pos": "記号", "pos_detail_1": "読点", "pos_detail_2": "*", "pos_detail_3": "*", "conjugated_type": "*", "conjugated_form": "*", "basic_form": "、", "reading": "、", "pronunciation": "、" }, { "word_id": 10, "word_type": "UNKNOWN", "word_position": 26, "surface_form": " ", "pos": "記号", "pos_detail_1": "空白", "pos_detail_2": "*", "pos_detail_3": "*", "conjugated_type": "*", "conjugated_form": "*", "basic_form": "*" }, { "word_id": 80, "word_type": "UNKNOWN", "word_position": 27, "surface_form": "$", "pos": "名詞", "pos_detail_1": "サ変接続", "pos_detail_2": "*", "pos_detail_3": "*", "conjugated_type": "*", "conjugated_form": "*", "basic_form": "*" }, { "word_id": 100, "word_type": "UNKNOWN", "word_position": 28, "surface_form": "f", "pos": "名詞", "pos_detail_1": "一般", "pos_detail_2": "*", "pos_detail_3": "*", "conjugated_type": "*", "conjugated_form": "*", "basic_form": "*" }, { "word_id": 80, "word_type": "UNKNOWN", "word_position": 29, "surface_form": "(", "pos": "名詞", "pos_detail_1": "サ変接続", "pos_detail_2": "*", "pos_detail_3": "*", "conjugated_type": "*", "conjugated_form": "*", "basic_form": "*" }, { "word_id": 100, "word_type": "UNKNOWN", "word_position": 30, "surface_form": "A", "pos": "名詞", "pos_detail_1": "一般", "pos_detail_2": "*", "pos_detail_3": "*", "conjugated_type": "*", "conjugated_form": "*", "basic_form": "*" }, { "word_id": 80, "word_type": "UNKNOWN", "word_position": 31, "surface_form": ")", "pos": "名詞", "pos_detail_1": "サ変接続", "pos_detail_2": "*", "pos_detail_3": "*", "conjugated_type": "*", "conjugated_form": "*", "basic_form": "*" }, { "word_id": 10, "word_type": "UNKNOWN", "word_position": 32, "surface_form": " ", "pos": "記号", "pos_detail_1": "空白", "pos_detail_2": "*", "pos_detail_3": "*", "conjugated_type": "*", "conjugated_form": "*", "basic_form": "*" }, { "word_id": 80, "word_type": "UNKNOWN", "word_position": 33, "surface_form": "=", "pos": "名詞", "pos_detail_1": "サ変接続", "pos_detail_2": "*", "pos_detail_3": "*", "conjugated_type": "*", "conjugated_form": "*", "basic_form": "*" }, { "word_id": 10, "word_type": "UNKNOWN", "word_position": 34, "surface_form": " ", "pos": "記号", "pos_detail_1": "空白", "pos_detail_2": "*", "pos_detail_3": "*", "conjugated_type": "*", "conjugated_form": "*", "basic_form": "*" }, { "word_id": 90, "word_type": "UNKNOWN", "word_position": 35, "surface_form": "0", "pos": "名詞", "pos_detail_1": "数", "pos_detail_2": "*", "pos_detail_3": "*", "conjugated_type": "*", "conjugated_form": "*", "basic_form": "*" }, { "word_id": 80, "word_type": "UNKNOWN", "word_position": 36, "surface_form": "$", "pos": "名詞", "pos_detail_1": "サ変接続", "pos_detail_2": "*", "pos_detail_3": "*", "conjugated_type": "*", "conjugated_form": "*", "basic_form": "*" }, { "word_id": 10, "word_type": "UNKNOWN", "word_position": 37, "surface_form": " ", "pos": "記号", "pos_detail_1": "空白", "pos_detail_2": "*", "pos_detail_3": "*", "conjugated_type": "*", "conjugated_form": "*", "basic_form": "*" }, { "word_id": 92920, "word_type": "KNOWN", "word_position": 38, "surface_form": "が", "pos": "助詞", "pos_detail_1": "格助詞", "pos_detail_2": "一般", "pos_detail_3": "*", "conjugated_type": "*", "conjugated_form": "*", "basic_form": "が", "reading": "ガ", "pronunciation": "ガ" }, { "word_id": 2844470, "word_type": "KNOWN", "word_position": 39, "surface_form": "成り立つ", "pos": "動詞", "pos_detail_1": "自立", "pos_detail_2": "*", "pos_detail_3": "*", "conjugated_type": "五段・タ行", "conjugated_form": "基本形", "basic_form": "成り立つ", "reading": "ナリタツ", "pronunciation": "ナリタツ" }, { "word_id": 90940, "word_type": "KNOWN", "word_position": 43, "surface_form": "。", "pos": "記号", "pos_detail_1": "句点", "pos_detail_2": "*", "pos_detail_3": "*", "conjugated_type": "*", "conjugated_form": "*", "basic_form": "。", "reading": "。", "pronunciation": "。" } ] ```
naskya commented 1 year ago

@azu すみません。頭が回っていないときに適当なコメントを書いてしまいましたがこれは不適切な指摘でしたので撤回します。

私は普段 LaTeX 文書の校正にのみ textlint を使用しており、自分が textlint-plugin-latex2e を併用している(ある意味特殊な状態の textlint を使っている)ことを失念していました。

ご指摘の通り、textlint-plugin-latex2e を使わずにこの文を lint しても特にエラーは生じませんでしたので、これは no-doubled-joshi の不具合ではないと考えられます。

また、今後何か怪しいケースを見つけた場合にはプレーンなテキストで再現するケースを作るようにいたします。よろしくお願いします。