mymagicpower / AIAS

免费,可商用,Java AI 人工智能一站式解决方案,为工作减负,为产品研发加速。项目类别包括:Java版 Pytorch 训练引擎,AI SDK,web应用等在内,合计超过100个项目组成的项目集。| Artificial Intelligence Accelerator Kit. It provides: a project collection consisting of over 100 projects, including AI SDK, web applications, desktop applications, image generation,
http://aias.top
Apache License 2.0
775 stars 264 forks source link

请问GPT2TokenizerFast的实现有规划吗 #16

Open zjcDM opened 1 year ago

zjcDM commented 1 year ago

已经实现

mymagicpower commented 1 year ago

用这个方法:

1. pom 配置

    <dependency>
        <groupId>ai.djl.huggingface</groupId>
        <artifactId>tokenizers</artifactId>
        <version>0.19.0</version>
    </dependency>

private static final HuggingFaceTokenizer tokenizer;

2. 例子代码

# 声明
static {
    try {
        tokenizer =
                HuggingFaceTokenizer.builder()
                        .optManager(manager)
                        .optPadding(true)
                        .optPadToMaxLength()
                        .optMaxLength(MAX_LENGTH)
                        .optTruncation(true)
                        .optTokenizerName("openai/clip-vit-large-patch14")
                        .build();
        // sentence-transformers/msmarco-distilbert-dot-v5
        // openai/clip-vit-large-patch14
        // https://huggingface.co/sentence-transformers/msmarco-distilbert-dot-v5
        // https://huggingface.co/runwayml/stable-diffusion-v1-5/blob/main/tokenizer/tokenizer_config.json
    } catch (IOException e) {
        throw new RuntimeException(e);
    }
}

# 使用
List<String> tokens = tokenizer.tokenize(prompt);
mymagicpower commented 1 year ago

https://github.com/deepjavalibrary/djl/blob/master/extensions/tokenizers/README.md

zjcDM commented 1 year ago

https://github.com/deepjavalibrary/djl/blob/master/extensions/tokenizers/README.md

你好,这个好像无法自定义词表?