Phraze tokenized as single token

epaminond commented 5 years ago

I was trying to follow instruction on the official website:

For example, we want a search for 空港 (airport) to match 関西国際空港 (Kansai International Airport), but most analyzers don’t allow this since 関西国際空港 tends to become one token.

For me it gets tokenized as single word:

kuromoji.builder({ dicPath: "dict/" }).build((err, tokenizer) => {
  console.log(tokenizer.tokenize("関西国際空港"));
});

// => [ { word_id: 1271160,
//        word_type: 'KNOWN',
//        word_position: 1,
//        surface_form: '関西国際空港',
//        pos: '名詞',
//        pos_detail_1: '固有名詞',
//        pos_detail_2: '組織',
//        pos_detail_3: '*',
//        conjugated_type: '*',
//        conjugated_form: '*',
//        basic_form: '関西国際空港',
//        reading: 'カンサイコクサイクウコウ',
//        pronunciation: 'カンサイコクサイクーコー' } ]

@takuyaa , is it an issue with dictionary? Is there a way to convert Ipadic dictionary to .dat format?

takuyaa commented 5 years ago

@epaminond Thanks for your feedback! That is not feature of kuromoji.js (JavaScript), but Kuromoji (Java). It is not supported yet on kuromoji.js, but could be supported in the future.

epaminond commented 5 years ago

OK, I've made use of https://github.com/leungwensen/tiny-segmenter for now.

takuyaa commented 5 years ago

That's great idea for some use cases. I close this issue.

takuyaa / kuromoji.js

Phraze tokenized as single token #30