jina-ai / jina

☁️ Build multimodal AI applications with cloud-native stack
https://docs.jina.ai
Apache License 2.0
20.93k stars 2.22k forks source link

πŸ€” Is it possible to configure multilanguage search? #2619

Closed hmmhmmhm closed 3 years ago

hmmhmmhm commented 3 years ago

Can it be serviced in non-English speaking countries?

For example, can artificial intelligence search be performed on Korean data?

I wonder if Tokenizer is needed separately for multilingual support and how it can be configured if necessary.

davidbp commented 3 years ago

Yes, it can be serviced with any type of data you want to use it.

Note that a Jina Document can store in the attribute text any string (which might be Korean, Spanish, English...). As you mention having a language specific Tokenizer is important and you would use it as a preprocessing step in most sentence to vector pipelines.

Let me show an example of probably the simplest vector representation for search: the TF-IDF

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer

from jina import Executor, requests, DocumentArray

class TFIDFTextEncoder(Executor):

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

        from sklearn import datasets

        dataset = fetch_20newsgroups() # here you should load a korean corpus instead
        tfidf_vectorizer = TfidfVectorizer(tokenizer=korean_tokenizer)
        tfidf_vectorizer.fit(dataset.data)
        self.ttfidf_vectorizer = tfidf_vectorizer

    @requests
    def encode(self, docs: DocumentArray, *args, **kwargs):
        iterable_of_texts = docs.get_attributes('text')
        embedding_matrix = self.tfidf_vectorizer.transform(iterable_of_texts)

        for i, doc in enumerate(docs):
            doc.embedding = embedding_matrix[i]

from konlpy.tag import Kkma
k = Kkma()

def korean_tokenizer(text):
    tokens = [x[0] for x in k.pos(text)]
    return tokens

def naive_tokenizer(text):
    return text.split(' ')

text = u'λ„€, μ•ˆλ…•ν•˜μ„Έμš”. λ°˜κ°‘μŠ΅λ‹ˆλ‹€.'
tokenizer(text)

# korean_tokenizer should return
# ['λ„€', ',', 'μ•ˆλ…•', 'ν•˜', 'μ„Έμš”', '.', 'λ°˜κ°‘', 'μŠ΅λ‹ˆλ‹€', '.']

# naive_tokenizer should return
# ['λ„€,', 'μ•ˆλ…•ν•˜μ„Έμš”.', 'λ°˜κ°‘μŠ΅λ‹ˆλ‹€.']

If instead of TfidfVectorizer you want to use an embedding method that deals with the tokenization part in Korean, such as KoSentenceBert, you are free to use it. You just would need to wrap a embedder = SentenceTransformer() in a class KoreanTextEncoder(Executor) with the method encode which would call embedder.encode().

hmmhmmhm commented 3 years ago

@davidbp Ooooohhhh..!!!! I didn't expect to get this detailed code, but thank you so much!! πŸ‘πŸ»

CatStark commented 3 years ago

Good to hear it solved your question @hmmhmmhm , I'll close this issue then.