microsoft / DeepSpeed-MII

MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.
Apache License 2.0
1.76k stars 163 forks source link

Allow model to generate added tokens - fix generation issue in Llama3 models #473

Closed weiqisun closed 4 days ago

weiqisun commented 1 month ago

Currently MII use vocab_size to truncate generated logits from the model, which is set as tokenizer.vocab_size in modeling.tokenizers.HFTokenizer.

However, in transformers the vocab_size attribute of a tokenizer only counts the base vocabulary size. It doesn't include additional tokens added to the tokenizer:

    @property
    def vocab_size(self) -> int:
        """
        `int`: Size of the base vocabulary (without the added tokens).
        """
        return self._tokenizer.get_vocab_size(with_added_tokens=False)

This causes an issue that MII can never generate added tokens in the vocabulary. In Llama 3, all special tokens, include the bos token, eos token, and other special tokens, are added tokens:

>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
>>> len(tokenizer)
128256
>>> len(tokenizer.vocab)
128256
>>> tokenizer.vocab_size
128000
>>> print(tokenizer.bos_token_id, tokenizer.eos_token_id, tokenizer.vocab["<|eot_id|>"])
128000 128001 128009

Due to this issue, the logits of special tokens in the Llama3 model are truncated. Therefore the eos token or <|eot_id|> token will never be generated from the decoding step. Thus the generation will never stop until reach the max_new_tokens or max_lenght limit. To reproduce this issue, load any Llama3 model and try any prompts. The generation will never stop before reaching the length limit.

Instead of tokenizer.vocab_size, len(tokenizer) should be used since it includes the added tokens:

    def __len__(self) -> int:
        """
        Size of the full vocabulary with the added tokens.
        """
        return self._tokenizer.get_vocab_size(with_added_tokens=True)
weiqisun commented 1 month ago

@microsoft-github-policy-service agree

regybean commented 1 month ago

I have tested this with Llama-3-8b and this does not seem to stop the early stopping issues import mii client = mii.serve("meta-llama/Meta-Llama-3-8B-Instruct", tensor_parallel=2) response = client.generate(["what is the capital of france, only use one word", "Seattle is what in under 10 words"], max_new_tokens=512) print(response)

weiqisun commented 1 month ago

Hi @regybean, the instruction tuned version of the Llama-3-8B model follows a specific chat template. Try this prompt:

>>> prompt = '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a polite chatbot who always give helpful responses<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nwhat is the capital of france, only use one word"<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n'
>>> client.generate(prompt, max_new_tokens=512)
[Paris]

Without this PR it will generate:

>>> client.generate(prompt, max_new_tokens=64)
[Paris. (Note: I'll try to keep it concise, while still providing accurate information!) Would you like to know more about Paris or France? I'd be happy to help! 🙏) Neighbor).) \               **Note: There was no punctuation in the previous response because I was trying to stay within]
regybean commented 1 month ago

Hi @weiqisun I tried using your exact example and it did not work for me, I explicitly installed deepspeed-mii through your branch. I even tested the non-instruct version, which did not stop. The only thing that has worked so far is the following prompt. Perhaps there is an issue with the tokeniser, or using tensor_parallel=2 is causing issues?

client = mii.serve("NousResearch/Meta-Llama-3-8B", tensor_parallel=2) prompt= '"how can i bake a cake, only use 20 word. End your message with the stop token which is <|eot_id|>"' response = client.generate(prompt, max_new_tokens=512) print(response)

[edit] Llama changed their huggingface repo as of today which fixed the eos token issues, a redownload fixed the issues :)

weiqisun commented 1 month ago

[edit] Llama changed their huggingface repo as of today which fixed the eos token issues, a redownload fixed the issues :)

Cool, I'm glad to hear the issue was resolved :). Right, the instruction tuned version of the Llama 3 model uses a different eos token (<|eot_id|>) than the original eos token specified in the pre-trained base model (<|end_of_text|>). Either downloading the latest model or setting the eos_token_id in tokenizer/model config can fix the eos token issue.

weiqisun commented 1 month ago

@mrwyattii @awan-10 Can you please look at this quick PR? Thanks

weiqisun commented 1 month ago

@awan-10 @lekurile Thanks for approving this PR! There is 1 unit test failed due to a model download error (see below for the detailed error log). This PR has nothing to do with the failed test or the model download. It came from a FS lock issue. Do you have any ideas of why this happens?

==================================== ERRORS ====================================
_ ERROR at setup of test_local_model_dir[True-nofail-text-generation-None-1-auto-facebook/opt-125m-True] _

model_name = 'facebook/opt-125m', local_model = True
tmpdir = local('/tmp/pytest-of-root/pytest-0/test_local_model_dir_True_nofa0')

    @pytest.fixture(scope="function")
    def model_path(model_name, local_model, tmpdir):
        if not local_model:
            return None

        base_dir = os.getenv("HF_HOME", tmpdir)
        download_dir = os.path.join(base_dir, "mii-ci-models", model_name)
>       snapshot_download(model_name, local_dir=download_dir)

conftest.py:87: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/usr/local/lib/python3.8/dist-packages/huggingface_hub/utils/_validators.py:114: in _inner_fn
    return fn(*args, **kwargs)
/usr/local/lib/python3.8/dist-packages/huggingface_hub/_snapshot_download.py:294: in snapshot_download
    thread_map(
/usr/local/lib/python3.8/dist-packages/tqdm/contrib/concurrent.py:69: in thread_map
    return _executor_map(ThreadPoolExecutor, fn, *iterables, **tqdm_kwargs)
/usr/local/lib/python3.8/dist-packages/tqdm/contrib/concurrent.py:51: in _executor_map
    return list(tqdm_class(ex.map(fn, *iterables, chunksize=chunksize), **kwargs))
/usr/local/lib/python3.8/dist-packages/tqdm/std.py:11[78](https://github.com/microsoft/DeepSpeed-MII/actions/runs/9196653791/job/25391010079?pr=473#step:9:79): in __iter__
    for obj in iterable:
/usr/lib/python3.8/concurrent/futures/_base.py:619: in result_iterator
    yield fs.pop().result()
/usr/lib/python3.8/concurrent/futures/_base.py:444: in result
    return self.__get_result()
/usr/lib/python3.8/concurrent/futures/_base.py:389: in __get_result
    raise self._exception
/usr/lib/python3.8/concurrent/futures/thread.py:57: in run
    result = self.fn(*self.args, **self.kwargs)
/usr/local/lib/python3.8/dist-packages/huggingface_hub/_snapshot_download.py:268: in _inner_hf_hub_download
    return hf_hub_download(
/usr/local/lib/python3.8/dist-packages/huggingface_hub/utils/_validators.py:114: in _inner_fn
    return fn(*args, **kwargs)
/usr/local/lib/python3.8/dist-packages/huggingface_hub/file_download.py:1202: in hf_hub_download
    return _hf_hub_download_to_local_dir(
/usr/local/lib/python3.8/dist-packages/huggingface_hub/file_download.py:1406: in _hf_hub_download_to_local_dir
    paths = get_local_download_paths(local_dir=local_dir, filename=filename)
/usr/local/lib/python3.8/dist-packages/huggingface_hub/_local_folder.py:138: in get_local_download_paths
    metadata_path = _huggingface_dir(local_dir) / "download" / f"{sanitized_filename}.metadata"
/usr/local/lib/python3.8/dist-packages/huggingface_hub/_local_folder.py:223: in _huggingface_dir
    with WeakFileLock(gitignore_lock):
/usr/lib/python3.8/contextlib.py:113: in __enter__
    return next(self.gen)
/usr/local/lib/python3.8/dist-packages/huggingface_hub/utils/_fixes.py:[84](https://github.com/microsoft/DeepSpeed-MII/actions/runs/9196653791/job/25391010079?pr=473#step:9:85): in WeakFileLock
    lock.acquire()
/usr/local/lib/python3.8/dist-packages/filelock/_api.py:295: in acquire
    self._acquire()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <filelock._unix.UnixFileLock object at 0x70ff07[85](https://github.com/microsoft/DeepSpeed-MII/actions/runs/9196653791/job/25391010079?pr=473#step:9:86)6fd0>

    def _acquire(self) -> None:
        ensure_directory_exists(self.lock_file)
        open_flags = os.O_RDWR | os.O_TRUNC
        if not Path(self.lock_file).exists():
            open_flags |= os.O_CREAT
>       fd = os.open(self.lock_file, open_flags, self._context.mode)
E       FileNotFoundError: [Errno 2] No such file or directory: '/tmp/pytest-of-root/pytest-0/test_local_model_dir_True_nofa0/mii-ci-models/facebook/opt-125m/.huggingface/.gitignore.lock'

/usr/local/lib/python3.8/dist-packages/filelock/_unix.py:42: FileNotFoundError
loadams commented 4 days ago

@awan-10 @lekurile Thanks for approving this PR! There is 1 unit test failed due to a model download error (see below for the detailed error log). This PR has nothing to do with the failed test or the model download. It came from a FS lock issue. Do you have any ideas of why this happens?

==================================== ERRORS ====================================
_ ERROR at setup of test_local_model_dir[True-nofail-text-generation-None-1-auto-facebook/opt-125m-True] _

model_name = 'facebook/opt-125m', local_model = True
tmpdir = local('/tmp/pytest-of-root/pytest-0/test_local_model_dir_True_nofa0')

    @pytest.fixture(scope="function")
    def model_path(model_name, local_model, tmpdir):
        if not local_model:
            return None

        base_dir = os.getenv("HF_HOME", tmpdir)
        download_dir = os.path.join(base_dir, "mii-ci-models", model_name)
>       snapshot_download(model_name, local_dir=download_dir)

conftest.py:87: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/usr/local/lib/python3.8/dist-packages/huggingface_hub/utils/_validators.py:114: in _inner_fn
    return fn(*args, **kwargs)
/usr/local/lib/python3.8/dist-packages/huggingface_hub/_snapshot_download.py:294: in snapshot_download
    thread_map(
/usr/local/lib/python3.8/dist-packages/tqdm/contrib/concurrent.py:69: in thread_map
    return _executor_map(ThreadPoolExecutor, fn, *iterables, **tqdm_kwargs)
/usr/local/lib/python3.8/dist-packages/tqdm/contrib/concurrent.py:51: in _executor_map
    return list(tqdm_class(ex.map(fn, *iterables, chunksize=chunksize), **kwargs))
/usr/local/lib/python3.8/dist-packages/tqdm/std.py:11[78](https://github.com/microsoft/DeepSpeed-MII/actions/runs/9196653791/job/25391010079?pr=473#step:9:79): in __iter__
    for obj in iterable:
/usr/lib/python3.8/concurrent/futures/_base.py:619: in result_iterator
    yield fs.pop().result()
/usr/lib/python3.8/concurrent/futures/_base.py:444: in result
    return self.__get_result()
/usr/lib/python3.8/concurrent/futures/_base.py:389: in __get_result
    raise self._exception
/usr/lib/python3.8/concurrent/futures/thread.py:57: in run
    result = self.fn(*self.args, **self.kwargs)
/usr/local/lib/python3.8/dist-packages/huggingface_hub/_snapshot_download.py:268: in _inner_hf_hub_download
    return hf_hub_download(
/usr/local/lib/python3.8/dist-packages/huggingface_hub/utils/_validators.py:114: in _inner_fn
    return fn(*args, **kwargs)
/usr/local/lib/python3.8/dist-packages/huggingface_hub/file_download.py:1202: in hf_hub_download
    return _hf_hub_download_to_local_dir(
/usr/local/lib/python3.8/dist-packages/huggingface_hub/file_download.py:1406: in _hf_hub_download_to_local_dir
    paths = get_local_download_paths(local_dir=local_dir, filename=filename)
/usr/local/lib/python3.8/dist-packages/huggingface_hub/_local_folder.py:138: in get_local_download_paths
    metadata_path = _huggingface_dir(local_dir) / "download" / f"{sanitized_filename}.metadata"
/usr/local/lib/python3.8/dist-packages/huggingface_hub/_local_folder.py:223: in _huggingface_dir
    with WeakFileLock(gitignore_lock):
/usr/lib/python3.8/contextlib.py:113: in __enter__
    return next(self.gen)
/usr/local/lib/python3.8/dist-packages/huggingface_hub/utils/_fixes.py:[84](https://github.com/microsoft/DeepSpeed-MII/actions/runs/9196653791/job/25391010079?pr=473#step:9:85): in WeakFileLock
    lock.acquire()
/usr/local/lib/python3.8/dist-packages/filelock/_api.py:295: in acquire
    self._acquire()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <filelock._unix.UnixFileLock object at 0x70ff07[85](https://github.com/microsoft/DeepSpeed-MII/actions/runs/9196653791/job/25391010079?pr=473#step:9:86)6fd0>

    def _acquire(self) -> None:
        ensure_directory_exists(self.lock_file)
        open_flags = os.O_RDWR | os.O_TRUNC
        if not Path(self.lock_file).exists():
            open_flags |= os.O_CREAT
>       fd = os.open(self.lock_file, open_flags, self._context.mode)
E       FileNotFoundError: [Errno 2] No such file or directory: '/tmp/pytest-of-root/pytest-0/test_local_model_dir_True_nofa0/mii-ci-models/facebook/opt-125m/.huggingface/.gitignore.lock'

/usr/local/lib/python3.8/dist-packages/filelock/_unix.py:42: FileNotFoundError

@weiqisun - this appears to be resolved now. Do we need to take care of any other reviews/changes on this PR before merging?

weiqisun commented 4 days ago

@weiqisun - this appears to be resolved now. Do we need to take care of any other reviews/changes on this PR before merging?

Thanks @loadams! I just updated the model names used in the new unit tests to get rid of the following model authentication error. The model names are now consistent with those used in the other unit tests.

Access to model mistralai/Mistral-7B-v0.1 is restricted. You must be authenticated to access it.

It is good to go now.