jyhsu2000 / CKIPService

Web service for ckiplab/ckiptagger
6 stars 4 forks source link

使用 JSON 作為 response 格式 #3

Closed johnroyer closed 2 months ago

johnroyer commented 3 months ago

使用 curl 測試的結果如下:

curl -X POST localhost:5005 -F 'sentence_list=湯姆克魯斯駕駛 SR-72 來逢甲夜市'
湯姆克魯斯(Nb) 駕駛(VC)  SR-72 (FW) 來(D) 逢甲(Nb) 夜市(Nc) 
(0, 5, 'PERSON', '湯姆克魯斯')
(15, 17, 'GPE', '逢甲')

其中第一行的詞彙之間有半形空白,也有全形空白,覺得這不方便程式處理。

假若使用類似下面的 JSON 格式,或許會比較容易辨識:

[
    {
        "詞彙": "湯姆克魯斯",
        "類別": "Nb"
    },
    {
        "詞彙": "駕駛",
        "類別": "VC"
    }
]

其中 詞彙類別 需要修改,但我不太清楚 CKIP 專案中是使用哪幾個術語來代表這類型的內容。

jyhsu2000 commented 3 months ago

他們對此的稱呼分別是 詞彙 / word詞性 / pos

參考 CkipTagger 專案的 README.md中文 README

johnroyer commented 2 months ago

請問 (15, 17, 'GPE', '逢甲') 這段,是 named entity recognition (NER) 嗎?

第三個參數應該是 Entity Type ,不知道前面二個參數是否有文件說明?

謝謝

johnroyer commented 2 months ago

稍做調整以後,JSON 輸出如下:

{
  "sentences": [
    {
      "segments": [
        {
          "word": "土地公",
          "pos": "Nb"
        },
        {
          "word": "有",
          "pos": "V_2"
        },
        {
          "word": "政策",
          "pos": "Na"
        },
        {
          "word": "?",
          "pos": "QUESTIONCATEGORY"
        },
        {
          "word": "?",
          "pos": "QUESTIONCATEGORY"
        },
        {
          "word": "還是",
          "pos": "Caa"
        },
        {
          "word": "土地",
          "pos": "Na"
        },
        {
          "word": "婆",
          "pos": "Na"
        },
        {
          "word": "有",
          "pos": "V_2"
        },
        {
          "word": "政策",
          "pos": "Na"
        },
        {
          "word": "。",
          "pos": "PERIODCATEGORY"
        },
        {
          "word": ".",
          "pos": "PERIODCATEGORY"
        }
      ],
      "entities": [
        "(0, 3, 'PERSON', '土地公')"
      ]
    },
    {
      "segments": [
        {
          "word": "最多",
          "pos": "VH"
        },
        {
          "word": "容納",
          "pos": "VJ"
        },
        {
          "word": "59,000",
          "pos": "Neu"
        },
        {
          "word": "個",
          "pos": "Nf"
        },
        {
          "word": "人",
          "pos": "Na"
        },
        {
          "word": ",",
          "pos": "COMMACATEGORY"
        },
        {
          "word": "或",
          "pos": "Caa"
        },
        {
          "word": "5.9萬",
          "pos": "Neu"
        },
        {
          "word": "人",
          "pos": "Na"
        },
        {
          "word": ",",
          "pos": "COMMACATEGORY"
        },
        {
          "word": "再",
          "pos": "D"
        },
        {
          "word": "多",
          "pos": "D"
        },
        {
          "word": "就",
          "pos": "D"
        },
        {
          "word": "不行",
          "pos": "VH"
        },
        {
          "word": "了",
          "pos": "T"
        },
        {
          "word": ".",
          "pos": "PERIODCATEGORY"
        },
        {
          "word": "這",
          "pos": "Nep"
        },
        {
          "word": "是",
          "pos": "SHI"
        },
        {
          "word": "環評",
          "pos": "Na"
        },
        {
          "word": "的",
          "pos": "DE"
        },
        {
          "word": "結論",
          "pos": "Na"
        },
        {
          "word": ".",
          "pos": "PERIODCATEGORY"
        }
      ],
      "entities": [
        "(4, 10, 'CARDINAL', '59,000')",
        "(14, 18, 'CARDINAL', '5.9萬')"
      ]
    }
  ]
}

不曉得這樣是否符合 CKIP tagger 原有的意思?


目前還是找不到 named entity recognition 輸出格式的說明,想請你幫忙。謝謝。

jyhsu2000 commented 2 months ago

NER 的輸出格式( (15, 17, 'GPE', '逢甲'))前兩位,是該詞彙在輸入句子(湯姆克魯斯駕駛 SR-72 來逢甲夜市)的出現起訖位置 原專案似乎沒有對此特別命名或定義 具體邏輯可參考 src/api.py_get_entity_set


JSON 的結構看起來沒什麼問題,只要把 entities 再行拆解就可以了