Closed hmmhmmhm closed 3 years ago
Yes, it can be serviced with any type of data you want to use it.
Note that a Jina Document
can store in the attribute text
any string (which might be Korean, Spanish, English...). As you mention having a language specific Tokenizer is important and you would use it as a preprocessing step in most sentence to vector
pipelines.
Let me show an example of probably the simplest vector representation for search: the TF-IDF
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from jina import Executor, requests, DocumentArray
class TFIDFTextEncoder(Executor):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
from sklearn import datasets
dataset = fetch_20newsgroups() # here you should load a korean corpus instead
tfidf_vectorizer = TfidfVectorizer(tokenizer=korean_tokenizer)
tfidf_vectorizer.fit(dataset.data)
self.ttfidf_vectorizer = tfidf_vectorizer
@requests
def encode(self, docs: DocumentArray, *args, **kwargs):
iterable_of_texts = docs.get_attributes('text')
embedding_matrix = self.tfidf_vectorizer.transform(iterable_of_texts)
for i, doc in enumerate(docs):
doc.embedding = embedding_matrix[i]
from konlpy.tag import Kkma
k = Kkma()
def korean_tokenizer(text):
tokens = [x[0] for x in k.pos(text)]
return tokens
def naive_tokenizer(text):
return text.split(' ')
text = u'λ€, μλ
νμΈμ. λ°κ°μ΅λλ€.'
tokenizer(text)
# korean_tokenizer should return
# ['λ€', ',', 'μλ
', 'ν', 'μΈμ', '.', 'λ°κ°', 'μ΅λλ€', '.']
# naive_tokenizer should return
# ['λ€,', 'μλ
νμΈμ.', 'λ°κ°μ΅λλ€.']
If instead of TfidfVectorizer
you want to use an embedding method that deals with the tokenization part in Korean, such as KoSentenceBert, you are free to use it. You just would need to wrap a embedder = SentenceTransformer()
in a class KoreanTextEncoder(Executor)
with the method encode
which would call embedder.encode()
.
@davidbp Ooooohhhh..!!!! I didn't expect to get this detailed code, but thank you so much!! ππ»
Good to hear it solved your question @hmmhmmhm , I'll close this issue then.
Can it be serviced in non-English speaking countries?
For example, can artificial intelligence search be performed on Korean data?
I wonder if Tokenizer is needed separately for multilingual support and how it can be configured if necessary.