🤔 Is it possible to configure multilanguage search?

hmmhmmhm commented 3 years ago

Can it be serviced in non-English speaking countries?

For example, can artificial intelligence search be performed on Korean data?

I wonder if Tokenizer is needed separately for multilingual support and how it can be configured if necessary.

davidbp commented 3 years ago

Yes, it can be serviced with any type of data you want to use it.

Note that a Jina Document can store in the attribute text any string (which might be Korean, Spanish, English...). As you mention having a language specific Tokenizer is important and you would use it as a preprocessing step in most sentence to vector pipelines.

Let me show an example of probably the simplest vector representation for search: the TF-IDF

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer

from jina import Executor, requests, DocumentArray

class TFIDFTextEncoder(Executor):

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

        from sklearn import datasets

        dataset = fetch_20newsgroups() # here you should load a korean corpus instead
        tfidf_vectorizer = TfidfVectorizer(tokenizer=korean_tokenizer)
        tfidf_vectorizer.fit(dataset.data)
        self.ttfidf_vectorizer = tfidf_vectorizer

    @requests
    def encode(self, docs: DocumentArray, *args, **kwargs):
        iterable_of_texts = docs.get_attributes('text')
        embedding_matrix = self.tfidf_vectorizer.transform(iterable_of_texts)

        for i, doc in enumerate(docs):
            doc.embedding = embedding_matrix[i]

from konlpy.tag import Kkma
k = Kkma()

def korean_tokenizer(text):
    tokens = [x[0] for x in k.pos(text)]
    return tokens

def naive_tokenizer(text):
    return text.split(' ')

text = u'네, 안녕하세요. 반갑습니다.'
tokenizer(text)

# korean_tokenizer should return
# ['네', ',', '안녕', '하', '세요', '.', '반갑', '습니다', '.']

# naive_tokenizer should return
# ['네,', '안녕하세요.', '반갑습니다.']

If instead of TfidfVectorizer you want to use an embedding method that deals with the tokenization part in Korean, such as KoSentenceBert, you are free to use it. You just would need to wrap a embedder = SentenceTransformer() in a class KoreanTextEncoder(Executor) with the method encode which would call embedder.encode().

hmmhmmhm commented 3 years ago

@davidbp Ooooohhhh..!!!! I didn't expect to get this detailed code, but thank you so much!! 👍🏻

CatStark commented 3 years ago

Good to hear it solved your question @hmmhmmhm , I'll close this issue then.

jina-ai / jina

🤔 Is it possible to configure multilanguage search? #2619