Trouble with loading mistral via transformers

riedgar-ms commented 7 months ago

The bug

On a freshly created conda environment, attempting to load mistral-7b via Hugging Face fails.

To Reproduce

This is based on PR #741

git checkout riedgar-ms/enable-transformers-7b

conda create -n guidance-312 python=3.12
conda activate guidance-312

pip install -e .[test]
pip install accelerate llama-cpp-python

python -m pytest --selected_model transformers_mistral_7b_gpu .\tests\library\test_gen.py::test_various_regexes

I wind up with errors:

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
tests\utils.py:19: in get_model
    return get_transformers_model(model_name[13:], caching, **kwargs)
tests\utils.py:60: in get_transformers_model
    transformers_model_cache[key] = guidance.models.Transformers(
guidance\models\transformers\_transformers.py:209: in __init__
    TransformersEngine(model, tokenizer, compute_log_probs, **kwargs),
guidance\models\transformers\_transformers.py:114: in __init__
    TransformersTokenizer(model, tokenizer),
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <guidance.models.transformers._transformers.TransformersTokenizer object at 0x000001741467D130>, model = 'mistralai/Mistral-7B-v0.1'
tokenizer = LlamaTokenizerFast(name_or_path='mistralai/Mistral-7B-v0.1', vocab_size=32000, model_max_length=1000000000000000019884...special=True),
        2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
ignore_bos_token = False

    def __init__(self, model, tokenizer, ignore_bos_token=False):
        if tokenizer is None:
            tokenizer = self._tokenizer(model)

        self._orig_tokenizer = tokenizer

        # build out the set of byte_string tokens
        byte_tokens = []
        if hasattr(tokenizer, "byte_decoder"):
            byte_decoder = tokenizer.byte_decoder

            for i in range(len(tokenizer)):
                byte_coded = bytes([byte_decoder[c] for c in tokenizer.convert_ids_to_tokens(i)])
                byte_tokens.append(byte_coded)

        elif hasattr(tokenizer, "sp_model"):
            space_prefix = '▁'.encode()
            for i in range(len(tokenizer)):
                byte_coded = re.sub(br'<0x(..)>', lambda x: bytes.fromhex(x[1].decode()), tokenizer.sp_model.id_to_piece(i).encode())
                byte_tokens.append(byte_coded.replace(space_prefix, b" "))

        else:
            import transformers
            byte_decoder = transformers.AutoTokenizer.from_pretrained("gpt2", use_fast=False).byte_decoder # fall back to gpt2 mapping

            # some special tokens may not have their whitespace encoded...
            byte_decoder[' '] = 32
            byte_decoder['\n'] = 10
            byte_decoder['\r'] = 13
            byte_decoder['\t'] = 9
            byte_decoder['▁'] = 32

            # run a quick spot check to verify we can rebuild complex multi-token unicode symbols
            s = "’•¶∂ƒ˙∆£Ħ爨ൠᅘ∰፨"
            t = tokenizer
            reconstructed = b''
            for id in t(s)["input_ids"]:
>               reconstructed += bytes([byte_decoder[c] for c in t.convert_ids_to_tokens(id)])
E               KeyError: '’'

System info (please complete the following information):

OS (e.g. Ubuntu, Windows 11, Mac OS, etc.): Win11
Guidance Version (guidance.__version__): Synced fork

riedgar-ms commented 7 months ago

Note that AFAICT the actions on the PR are being OoM (or disk space) killed. However, that's another problem.

yonitjio commented 7 months ago

In my case, this is due the mistral tokenizer fell back to fast tokenizer which made the sp_model missing, installing sentencepiece solved it for me.

But then I get error on cleanup tokens

So I mod it like this:

            # ugly hack to deal with sentence peice craziness of space hiding after special tokens
            # TODO: figure out how to make this more robust
            diff = token_byte_positions[-1] - last_pos
            if diff > 0:
                for _ in range(diff):
                    if self.tokenizer.tokens[token_ids[0]] == b'<s>' \
                        and self.tokenizer.tokens[token_ids[1]][0:1] == b' ':
                        for i in range(1, len(token_byte_positions)):
                            token_byte_positions[i] -= 1
            assert token_byte_positions[-1] == last_pos

riedgar-ms commented 6 months ago

Hmmm.... adding sentencepiece to my pip installs is at least allowing my tests to get further. However, things are running a bit slowly, and I don't know if they will succeed yet.

yonitjio commented 6 months ago

Forgive me if I'm wrong,

The problem occurs because the default gpt2 byte encoder doesn't contain all of unicode characters.

This is from gpt2 byte_encoder which mapped to bytes_to_unicode: list(range(ord("!"), ord("~") + 1)) + list(range(ord("¡"), ord("¬") + 1)) + list(range(ord("®"), ord("ÿ") + 1))

From GPT2Tokenizer init function:

        self.byte_encoder = bytes_to_unicode()
        self.byte_decoder = {v: k for k, v in self.byte_encoder.items()}

String ’•¶∂ƒ˙∆£Ħ爨ൠᅘ∰፨ fail since it contains characters that are not in the list.

So, the question is, is it necessary to check this string?

riedgar-ms commented 6 months ago

The assert on that string should definitely be moved to a separate test. That might let some things work, but the underlying problem would still remain - the model can't cope with some valid unicode strings.

yonitjio commented 6 months ago

I think this should just give a warning instead.

I mean the original issue with mistral can already be solved with installing sentencepiece.

The gpt2 is already a worst case scenario right? And realistically, it's not possible to support every model out there.

Just give a warning that the model have no byte decoder or some message to inform the user.

LuoKaiGSW commented 5 months ago

sentencepiece

hey, @yonitjio , I don't understand why installing sentencepiece would solve this problem. According to the code, it seems like it would still go to the branch that uses gpt2?

yonitjio commented 5 months ago

If you don't install sentencepiece, the tokenizer will fallback to fast tokenizer which doesn't have sp_model.

See here https://github.com/guidance-ai/guidance/blob/e234c565b61ffb90dbbf81cd937a00505ef79649/guidance/models/transformers/_transformers.py#L99

LuoKaiGSW commented 5 months ago

I mean the original issue with mistral can already be solved with installing sentencepiece.

If you don't install sentencepiece, the tokenizer will fallback to fast tokenizer which doesn't have sp_model.

See here

https://github.com/guidance-ai/guidance/blob/e234c565b61ffb90dbbf81cd937a00505ef79649/guidance/models/transformers/_transformers.py#L99

I understand what you mean, but I'm currently using BloomTokenizer, and when using it, I can only set use_fast = True, because there is only the file tokenization_bloom_fast.py. This results in the tokenizer I get not having the two attributes byte_decoder and sp_model. Now I guess that for all fast tokenizers, their mapping relationship between bytes and unicode is the same as gpt2's, so gpt2.byte_decoder can be used as a substitute.

yonitjio commented 5 months ago

I suppose so.

But as I said before, I don't think it's realistic to support every model out there (for now?).

I can only think one other option instead of giving warning to user, that is to allow custom function for this.

guidance-ai / guidance

Trouble with loading mistral via transformers #743