how to use TextVectorization to vectorize Chinese text

baiziyuandyufei commented 4 years ago

In Chinese text, there is no whitespace between words, so when I use TextVectorization.adapt(train_dataset), I can only get Sentence-level vocabulary. the code I used is https://keras.io/examples/nlp/text_classification_from_scratch/

cherry247 commented 4 years ago

In Chinese text, there is no whitespace between words, so when I use TextVectorization.adapt(train_dataset), I can only get Sentence-level vocabulary. the code I used is https://keras.io/examples/nlp/text_classification_from_scratch/

Jieba library in python is built for chinese word segmentation. You can install pip install jieba . refer this for more information on how to use jieba and the methods .

baiziyuandyufei commented 4 years ago

In Chinese text, there is no whitespace between words, so when I use TextVectorization.adapt(train_dataset), I can only get Sentence-level vocabulary. the code I used is https://keras.io/examples/nlp/text_classification_from_scratch/

Jieba library in python is built for chinese word segmentation. You can install pip install jieba . refer this for more information on how to use jieba and the methods .

which function's parameter can pass the jieba.cut() directly？custom_standardization？In this function, I can only insert space between words, then return a new string. Is these a function's parameter can accept the jieba.cut() directly?

baiziyuandyufei commented 4 years ago

In Chinese text, there is no whitespace between words, so when I use TextVectorization.adapt(train_dataset), I can only get Sentence-level vocabulary. the code I used is https://keras.io/examples/nlp/text_classification_from_scratch/

Jieba library in python is built for chinese word segmentation. You can install pip install jieba . refer this for more information on how to use jieba and the methods .

which function's parameter can pass the jieba.segment()？custom_standardization？

baiziyuandyufei commented 4 years ago

@cherry247 Until now, as far as I know，there is no library support tensor data segment, may be HanLP2.x support.I want to use the parameter 'split' to segment text. the demo code like this:

# coding:utf-8
"""

"""
import tensorflow as tf
from pyhanlp.static import HANLP_DATA_PATH
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
from pyhanlp import *

chn_senti_corp = os.path.join(HANLP_DATA_PATH, r'train/ChnSentiCorp')
batch_size = 32
raw_train_ds = tf.keras.preprocessing.text_dataset_from_directory(
    chn_senti_corp,
    batch_size=batch_size,
    validation_split=0.2,
    subset="training",
    seed=1337,
)
raw_val_ds = tf.keras.preprocessing.text_dataset_from_directory(
    chn_senti_corp,
    batch_size=batch_size,
    validation_split=0.2,
    subset="validation",
    seed=1337,
)
print("Number of batches in raw_train_ds: %d" % tf.data.experimental.cardinality(raw_train_ds))
print("Number of batches in raw_val_ds: %d" % tf.data.experimental.cardinality(raw_val_ds))

def preprocess(text_li):
    return tf.strings.unicode_split(text_li, input_encoding='UTF-8', errors="ignore")

max_features = 20000
vectorize_layer = TextVectorization(
    max_tokens=max_features,
    output_mode="int",
    split=preprocess,
)

text_ds = raw_train_ds.map(lambda x, y: x)
vectorize_layer.adapt(text_ds)
print("vocabulary top 50", vectorize_layer.get_vocabulary()[:50])
print("vocabulary size", len(vectorize_layer.get_vocabulary()))

integer_data = vectorize_layer([["一出差就去宾馆happy了"]])
print(integer_data)

the run result is

vocabulary top 50 ['', '[UNK]', '，', '\r', '的', '\n', '。', '不', '是', '了', '房', '一', '店', '有', '我', '酒', '间', '很', '住', '还', ' ', '服', '务', '在', '好', '到', '个', '没', '人', '这', '上', '也', '就', '0', '大', '！', '要', '来', '点', '以', '们', '2', '说', '可', '1', '时', '都', '去', '小', '后']
vocabulary size 3255
tf.Tensor([[ 11  80  58  32  47 111 123 336 412 514 514 637   9]], shape=(1, 13), dtype=int64)

I don't know the how jieba apply to the parameter 'split', or there is something else method to use jieba.

heart4lor commented 3 years ago

Hi, have you solved the problem yet? I'm facing the same issue, thanks!

sushreebarsa commented 2 years ago

@baiziyuandyufei Is this still an issue? I tried to replicate the issue and faced a different error ,please find the gist here .Thanks!

baiziyuandyufei commented 2 years ago

@sushreebarsa 没有下载完整数据集。 https://colab.research.google.com/gist/baiziyuandyufei/56680558c4ec263eef2f41e488d083e3/14225.ipynb

baiziyuandyufei commented 2 years ago

@sushreebarsa tf似乎就没给汉语分词器预留传入位置。直接在

def custom_standardization(input_data):
    lowercase = tf.strings.lower(input_data)
    stripped_html = tf.strings.regex_replace(lowercase, "<br />", " ")
    return tf.strings.regex_replace(
        stripped_html, f"[{re.escape(string.punctuation)}]", ""
    )

这个函数里分词后转成以空格分隔的字符串吧。或者，就用字符级的输入吧，基本所有深度学习模型都是基于汉字的输入。

sampathweb commented 1 year ago

@baiziyuandyufei - For Chinese text tokenization Regex or TextVectorization is not going to help. Others have created OSS packages for this that you could look at and adopt based on your situation. One such is https://github.com/yishn/chinese-tokenizer, but its not been maintained recently, so you may find others that are more upto date. Closing this issue as no action needs to be taken.

google-ml-butler[bot] commented 1 year ago

Are you satisfied with the resolution of your issue? Yes No

keras-team / keras

how to use TextVectorization to vectorize Chinese text #14225