purecloudlabs / roberta-tokenizer

MIT License
17 stars 1 forks source link

RoBERTa Java Tokenizer

About


This repo contains a Java tokenizer used by RoBERTa model. The implementation is mainly according to HuggingFace Python RoBERTa Tokenizer, but also we took references from other implementations as mentioned in the code and below:

The algorithm used is a byte-level Byte Pair Encoding.

https://huggingface.co/docs/transformers/tokenizer_summary#bytelevel-bpe

How do I get set up?


<dependency>
    <groupId>cloud.genesys</groupId>
    <artifactId>roberta-tokenizer</artifactId>
    <version>1.0.7</version>
</dependency>

<distributionManagement>
    <repository>
      <id>ossrh</id>
      <url>https://s01.oss.sonatype.org/service/local/staging/deploy/maven2/</url>
    </repository>
    ...
</distributionManagement>

Tests


File Dependencies


Since we want efficiency when initializing the tokenizer, we use a factory to create the relevant resources files and create it "lazily".

For this tokenizer we need 3 data files:

Please note:

  1. All three files must be under the same directory.

  2. They must be named like mentioned above.

  3. The result of the tokenization depends on the vocabulary and merges files.

Example



String baseDirPath = "base/dir/path";
RobertaTokenizerResources robertaResources = new RobertaTokenizerResources(baseDirPath);
Tokenizer robertaTokenizer = new RobertaTokenizer(robertaResources);
...
String sentence = "this must be the place";
long[] tokenizedSentence = robertaTokenizer.tokenize(sentence);
System.out.println(tokenizedSentence);

An example output would be: [0, 9226, 531, 28, 5, 317, 2] - Depends on the given vocabulary and merges files.

Contribution guidelines