Question: Japanese Support?

ekianjo commented 4 years ago

Description

This is not an issue but since this does not seem to be addressed in the documentation or guides, I would like to know if Japanese is supported, and whether a different language would cause issues with search?

kishorenc commented 4 years ago

@ekianjo You can index non-ASCII text and search on it without any issues. Here's an example from the tests involving some Tamil, which happens to be my mother tongue.

Of course typo correction is not going to work since that assumes an edit distance which might not be suitable for Japanese (correct me if I am wrong). If we can do something to help here, I will be interested to hear.

ekianjo commented 4 years ago

Thanks - how do calculate the edit distance at the moment? (sorry I did not check the code yet). Also for Japanese what may be relevant is that you can write things in hiragana, katanana and kanji - and usually you'd want the search to recognize that a search term could be written in different characters while having the same meaning. This is probably more complex to handle...

kishorenc commented 4 years ago

@ekianjo For edit distance we calculate the Damerau–Levenshtein distance between the tokens in a query and the indexed tokens from the records.

usually you'd want the search to recognize that a search term could be written in different characters while having the same meaning

I will do some research on how we can handle that.

tkanq commented 3 years ago

I tested Japanese search, and it turns out that it's not working.

I don't have any knowledge of full text search, but I guess because Japanese and other languages like Chinese/Korean doesn't have spaces between each meaningful words, they need to be segmented first (requires implementation of morphological analysis, I guess...).

Here is an example of morphological analysis library for Japanese language: https://github.com/WorksApplications/Sudachi

kishorenc commented 3 years ago

@tkanq

We have since added support for segmentation of Japanese text in our RC builds. Here's a quick snippet on handling Japanese text in the latest Docker RC build typesense/typesense:0.21.0.rc13:

curl -k "http://localhost:8108/collections" -X POST -H "Content-Type: application/json" \
      -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" -d '{
        "name": "titles", "num_memory_shards": 4,
        "fields": [
          {"name": "title", "type": "string", "locale": "ja" },
          {"name": "points", "type": "int32" }
        ],
        "default_sorting_field": "points"
      }'

# Index a document

curl "http://localhost:8108/collections/titles/documents" -X POST \
        -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" -d '{"points":113,"title":"ア退屈であ", "id": "0"}'

# Search for it

curl -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" http://localhost:8108/collections/titles/documents/search/\?q\=屈治\&query_by\=title

This should work -- can you please try it out and let me know? We also support conversion between hiragana/kanji -- so you can index text in either form and query in either form.

I will be happy to work with you to fix any issues you might encounter.

tkanq commented 3 years ago

@kishorenc I'm really happy to hear that! I'll try it later and get back to you soon.

tkanq commented 3 years ago

I did a small test which just create a few documents and simply search it, and found that queried word and returned "matched_tokens" are not the same or even close.

"q": "配管" vs

          "matched_tokens": [
            "",
            "品多"
          ],

Here is my code in node.js and the result:

const Ts = require('typesense')

const client = new Ts.Client({
  nodes: [{
    host: 'localhost',
    port: '8108',
    protocol: 'http'
  }],
  apiKey: 'xyz',
  connectionTimeoutSeconds: 2
})

const test = async function(){
  await client.collections().create({
    name: "item",
    fields: [
      { name: 'title', type: 'string', locale: 'ja'},
      { name: 'description', type: 'string', locale: 'ja'},
      { name: 'registeredAt', type: 'int64'},
    ],
    default_sorting_field: 'registeredAt'
  }).then(response => {
    console.log(`Collection schema successfully registered: \n${JSON.stringify(response)}`);
  })

  await client.collections('item').documents().create({
    title: '電動ドリル/電動ドライバー/インパクトドライバーの新品だよ',
    description: '2020年の後半に買いました。\nまだまだ新品に近いですし、完動品なので、是非誰かに使ってもらいたいです。',
    registeredAt: 20210601110203,
  }).then(response => {
    console.log(`\n\nDocument successfully registered: \n${JSON.stringify(response)}`);
  })

  await client.collections('item').documents().create({
    title: '腰道具一式 配管工用 新古品多数',
    description: '前の仕事で使っていたものですが、職業が変わったので出品します。これから配管工になる方にオススメです！',
    registeredAt: 20210601110203,
  }).then(response => {
    console.log(`\n\nDocument successfully registered: \n${JSON.stringify(response)}`);
  })

  await client.collections('item').documents().search({
    q: '配管',
    query_by: 'title,',
    sort_by: 'registeredAt:desc',
    prefix: true,
    num_typo: 0,
    typo_tokens_threshold: 0,
    drop_tokens_threshold: 0
  }).then(response => {
    console.log(`\n\nresponse: ${JSON.stringify(response, null, 2)}`);
  })

}
test()

Collection schema successfully registered: 
{"created_at":1623194687,"default_sorting_field":"registeredAt","fields":[{"facet":false,"index":true,"name":"title","optional":false,"type":"string"},{"facet":false,"index":true,"name":"description","optional":false,"type":"string"},{"facet":false,"index":true,"name":"registeredAt","optional":false,"type":"int64"}],"name":"item","num_documents":0,"num_memory_shards":4}

Document successfully registered:
{"description":"2020年の後半に買いました。\nまだまだ新品に近いですし、完動品なので、是非誰かに使ってもらいたいです。","id":"0","registeredAt":20210601110203,"title":"電動ドリル/電動ドライバー/インパクトドライバーの新品だよ"}

Document successfully registered:
{"description":"前の仕事で使っていたものですが、職業が変わったので出品します。これから配管工になる方にオススメです！","id":"1","registeredAt":20210601110203,"title":"腰道具一式 配管工用 新古品多数"}
data: {
  "facet_counts": [],
  "found": 1,
  "hits": [
    {
      "document": {
        "description": "前の仕事で使っていたものですが、職業が変わったので出品します。これから配管工になる方にオススメです！",        
        "id": "1",
        "registeredAt": 20210601110203,
        "title": "腰道具一式 配管工用 新古品多数"
      },
      "highlights": [
        {
          "field": "title",
          "matched_tokens": [
            "",
            "品多"
          ],
          "snippet": "腰道具一式 配管工用 <mark></mark><mark>品多</mark>"
        }
      ],
      "text_match": 50291456
    }
  ],
  "out_of": 2,
  "page": 1,
  "request_params": {
    "collection_name": "item",
    "per_page": 10,
    "q": "配管"
  },
  "search_time_ms": 4
}

Also, here in this website I could test how it is tokenized by MeCab/Kakasi/Chasen, so I threw the same "title" for the second item above, and turned out that the result from kakasi in the website and the one I got from typesense (as "matched_tokens") is slightly different. (This may not be relevant to the problem, though.)

2021-06-09

  await client.collections('item').documents().create({
    title: '腰道具一式 配管工用 新古品多数',

          "matched_tokens": [
            "",
            "品多"
          ],

kishorenc commented 3 years ago

@tkanq Thank you for reporting the issues you faced. I'm also using Kakasi for tokenization. The problem I'm facing is dealing with Japanese text with mixed writing forms (e.g. mix of Kanji and Hiragana characters). Currently Typesense normalizes the text and indexes it in Hiragana. When the query uses Kanji characters, like in this case, Typesense tries to highlight the characters on the normalized text but fails to find them since they will be indexed in Hiragana.

I will have to map the normalized query back to original form for highlighting.

tkanq commented 3 years ago

@kishorenc My use case is not an auto-completion but just a simple search of items, and in this case we don't need to index all the characters in Hiragana.

But for the auto-completion purpose, we better have chars which converted into Hiragana or Roman-ji, because: On PC - in most cases, we type roman-ji first to write Hiragana, then we (or IME) translate it into mixture of Kanji/Hiragana/Katakana. Like this On Mobile phone : in most cases, we type Hiragana directly, then we (or IME) translate it into mixture of Kanji/Hiragana/Katakana. Like this

So maybe an option for switching in which characters we can index (mixed/Hiragana only/Roman-ji only) could be nicer.

ueda19850603 commented 3 years ago

Hello. I am Japanese. I am using Typesense. My main use case is a full-text search complement that is missing in "Firestore".

I can't search Japanese well with Typesense. It doesn't seem to be able to search correctly because it doesn't separate words with "spaces" like in English. If it is separated by a space, it can be searched, so I decomposed the word into two letters and saved it. example: 【ぶつ切り前】今日はいい天気です。

【ぶつ切り後】今日　日は　はい　いい　い天　天気　気で　です　す。

This technique is called'bigram'.

Save the characters decomposed by'bigram'in Typesense. When searching, please also'bigram'the search keyword entered by the user. The accuracy of the search may be low because it ignores the context at all. However, you can instantly get a matching document from a long sentence. This accuracy works fine in my use case.

If you like Japanese, please refer to the Typesense blog I wrote. https://nipo.sndbox.jp/develop-blog/typesense

tkanq commented 3 years ago

@ueda19850603 I'm leaving this article for someone else here, to give an option when they consider firestore + bigram search: https://qiita.com/oukayuka/items/d3cee72501a55e8be44a

R0ci0-V3rd3al commented 3 years ago

it does not work

typesense / typesense

Question: Japanese Support? #86

Description