Train tokenizer on integer lists, not strings

rteehas commented 6 months ago

Hi,

I was hoping to train a BPE tokenizer, but in my case I have lists of integers rather than strings. I'd essentially like to apply the merging rules to adjacent integers in these lists, rather than to subword characters. Is there a straightforward way to do this? The current setup seems to require strings.

Natooz commented 6 months ago

Bumping this, as this would make the usage of the lib more easy and straightforward for modalities other than text, e.g. molecules, DNA, music.

In MidiTok we basically map each integer to a byte to "bypass" this limitation. But this is not straightforward and adds overhead.

Edit: also it can only scale up to the number of unicode characters.

lkurlandski commented 6 months ago

Second this. I'm training tokenizers on malware bytes. At the moment, I have to map bytes to utf8 characters before sending through the tokenizers library. The tokenizers should work on any sequence, not just strings.

Natooz commented 5 months ago

@Narsil @ArthurZucker how difficult do you estimate this?

github-actions[bot] commented 4 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

ArthurZucker commented 3 months ago

This sounds kinda fun, the merges are actually already done on chars so it's basically ints. The issue might be memory but we can still do it. A hack is to convert your ints to chars and train this way

lkurlandski commented 3 months ago

On converting bytes to chars, my initial solution looked like this:

def bytes_to_str_ascii(b: bytes) -> str:
    return b.decode("latin1")

Unfortunately, this caused some weird behavior inside tokenizers itself. The characters produced by this have a lot of backslashes (if I remember correctly, I believe consecutive occurrences of the newline character, e.g., "\n\n", was causing the problems). I can't read rust, so I was unable to figure out why this was happening.

The simple solution was to just map bytes to higher characters, e.g.,

BYTE_TO_UTF8 = tuple(chr(i + 10752) for i in range(256))

def bytes_to_str_utf8(b: bytes) -> str:
    return "".join(BYTE_TO_UTF8[i] for i in b)

However, this turns out to be slow and was a bottleneck for me, so I rewrote it in Cython:

# bytes_to_str_utf8.pyx

from cpython.bytes cimport PyBytes_GET_SIZE, PyBytes_AS_STRING
from cpython.unicode cimport PyUnicode_FromKindAndData, PyUnicode_4BYTE_KIND
from libc.stdlib cimport malloc, free

cdef Py_UCS4* BYTE_TO_UTF8 = <Py_UCS4*>malloc(256 * sizeof(Py_UCS4))

# Initialize the lookup table
cdef int i
for i in range(256):
    BYTE_TO_UTF8[i] = i + 10752

def bytes_to_str_utf8(bytes b):
    cdef Py_ssize_t length = PyBytes_GET_SIZE(b)
    cdef Py_UCS4* result = <Py_UCS4*>malloc(length * sizeof(Py_UCS4))
    cdef Py_ssize_t i
    cdef char* byte_ptr = PyBytes_AS_STRING(b)

    for i in range(length):
        result[i] = BYTE_TO_UTF8[<unsigned char>byte_ptr[i]]

    string = PyUnicode_FromKindAndData(PyUnicode_4BYTE_KIND, result, length)
    free(result)
    return string

To work with integers instead of bytes, all you have to do is slightly tweak that code to work on integers larger than 256.

github-actions[bot] commented 2 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

huggingface / tokenizers

Train tokenizer on integer lists, not strings #1471