mindspore-ai / mindspore

MindSpore is a new open source deep learning training/inference framework that could be used for mobile, edge and cloud scenarios.
https://gitee.com/mindspore/mindspore
Apache License 2.0
4.23k stars 700 forks source link

[dataset] how to inplement a simpler kinds of tokenizer: simple_space_split #45

Open littleDrew opened 4 years ago

littleDrew commented 4 years ago

Background

Introduction

Do you have some implemented way and code ?

qianlong21st commented 4 years ago

(1) You can use PythonTokenizer to achieve this function. see details in https://gitee.com/mindspore/mindspore/blob/master/mindspore/dataset/text/transforms.py and here is the test case to use PythonTokenizer: https://gitee.com/mindspore/mindspore/blob/master/tests/ut/python/dataset/test_python_tokenizer.py (2) You can also use WhitespaceTokenizer to split sentence by whitespace:

def test_whitespace_tokenizer():
    """
    Test WhitespaceTokenizer
    """
    whitespace_strs = [["Welcome", "to", "Beijing!"],
                       ["北京欢迎您!"],
                       ["我喜欢English!"],
                       [""]]
    dataset = ds.TextFileDataset(DATA_FILE, shuffle=False)
    tokenizer = nlp.WhitespaceTokenizer()
    dataset = dataset.map(operations=tokenizer)
    tokens = []
    for i in dataset.create_dict_iterator():
        text = nlp.to_str(i['text']).tolist()
        tokens.append(text)
    logger.info("The out tokens is : {}".format(tokens))
    assert whitespace_strs == tokens

see https://gitee.com/mindspore/mindspore/blob/master/tests/ut/python/dataset/test_tokenizer.py for details.