h2oai / h2ogpt

Private chat with local GPT with document, images, video, etc. 100% private, Apache 2.0. Supports oLLaMa, Mixtral, llama.cpp, and more. Demo: https://gpt.h2o.ai/ https://codellama.h2o.ai/
http://h2o.ai
Apache License 2.0
11.05k stars 1.2k forks source link

MPT-7B, 30B RuntimeError: Placeholder storage has not been allocated on MPS device! #463

Closed wizzard0 closed 11 months ago

wizzard0 commented 1 year ago

Trying to use MPT models with h2oai:

  1. python generate.py --base_model=mosaicml/mpt-7b-chat --score_model=None
  2. enter any prompt Expected behavior: Model is loaded and used

Observed behavior:

To create a public link, set share=True in launch(). The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results. Setting pad_token_id to eos_token_id:0 for open-end generation. /Users/user/dev/lib/python3.10/site-packages/transformers/generation/utils.py:1452: UserWarning: You are calling .generate() with the input_ids being on a device type different than your model's device. input_ids is on mps, whereas the model is on cpu. You may experience unexpected behaviors or slower generation. Please make sure that you have put input_ids to the correct device by calling for example input_ids = input_ids.to('cpu') before running .generate(). warnings.warn( thread exception: (<class 'RuntimeError'>, RuntimeError('Placeholder storage has not been allocated on MPS device!'), <traceback object at 0x8fe882200>) make stop: (<class 'RuntimeError'>, RuntimeError('Placeholder storage has not been allocated on MPS device!'), <traceback object at 0x8fe882200>) hit stop Traceback (most recent call last): File "/Users/user/dev/lib/python3.10/site-packages/gradio/routes.py", line 437, in run_predict output = await app.get_blocks().process_api( File "/Users/user/dev/lib/python3.10/site-packages/gradio/blocks.py", line 1352, in process_api result = await self.call_function( File "/Users/user/dev/lib/python3.10/site-packages/gradio/blocks.py", line 1093, in call_function prediction = await utils.async_iteration(iterator) File "/Users/user/dev/lib/python3.10/site-packages/gradio/utils.py", line 341, in async_iteration return await iterator.anext() File "/Users/user/dev/lib/python3.10/site-packages/gradio/utils.py", line 334, in anext return await anyio.to_thread.run_sync( File "/Users/user/dev/lib/python3.10/site-packages/anyio/to_thread.py", line 33, in run_sync return await get_asynclib().run_sync_in_worker_thread( File "/Users/user/dev/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread return await future File "/Users/user/dev/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 807, in run result = context.run(func, args) File "/Users/user/dev/lib/python3.10/site-packages/gradio/utils.py", line 317, in run_sync_iterator_async return next(iterator) File "/Users/user/dev/h2ogpt/src/gradio_runner.py", line 1428, in bot for res in get_response(fun1, history): File "/Users/user/dev/h2ogpt/src/gradio_runner.py", line 1385, in get_response for output_fun in fun1(): File "/Users/user/dev/h2ogpt/src/gen.py", line 2011, in evaluate raise thread.exc File "/Users/user/dev/h2ogpt/src/utils.py", line 340, in run self._return = self._target(self._args, self._kwargs) File "/Users/user/dev/h2ogpt/src/gen.py", line 2114, in generate_with_exceptions func(*args, *kwargs) File "/Users/user/dev/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(args, kwargs) File "/Users/user/dev/lib/python3.10/site-packages/transformers/generation/utils.py", line 1522, in generate return self.greedy_search( File "/Users/user/dev/lib/python3.10/site-packages/transformers/generation/utils.py", line 2339, in greedy_search outputs = self( File "/Users/user/dev/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, kwargs) File "/Users/user/.cache/huggingface/modules/transformers_modules/mosaicml/mpt-7b-chat/c53dee01e05098f81cac11145f9bf45feedc5b2f/modeling_mpt.py", line 270, in forward outputs = self.transformer(input_ids=input_ids, past_key_values=past_key_values, attention_mask=attention_mask, prefix_mask=prefix_mask, sequence_id=sequence_id, return_dict=return_dict, output_attentions=output_attentions, output_hidden_states=output_hidden_states, use_cache=use_cache) File "/Users/user/dev/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, *kwargs) File "/Users/user/.cache/huggingface/modules/transformers_modules/mosaicml/mpt-7b-chat/c53dee01e05098f81cac11145f9bf45feedc5b2f/modeling_mpt.py", line 168, in forward tok_emb = self.wte(input_ids) File "/Users/user/dev/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(args, kwargs) File "/Users/user/.cache/huggingface/modules/transformers_modules/mosaicml/mpt-7b-chat/c53dee01e05098f81cac11145f9bf45feedc5b2f/custom_embedding.py", line 11, in forward return super().forward(input) File "/Users/user/dev/lib/python3.10/site-packages/torch/nn/modules/sparse.py", line 162, in forward return F.embedding( File "/Users/user/dev/lib/python3.10/site-packages/torch/nn/functional.py", line 2210, in embedding return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) RuntimeError: Placeholder storage has not been allocated on MPS device!

Mathanraj-Sharma commented 1 year ago

I haven't tried this model yet, let me try it and update

wizzard0 commented 1 year ago

Updating pytorch to 2.1 changes the error to:

Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.
The model 'OptimizedModule' is not supported for . Supported models are ['BartForCausalLM', 'BertLMHeadModel', (...)

and generation works but is VERY slow (ie mpt-7B much slower than WizardLM-30B). GPU load is 0%, only CPU

Mathanraj-Sharma commented 1 year ago

I was able to start loading the model on MPS after tweaking the model config of the transformer. But I get this error since I don't have enough memory

RuntimeError: MPS backend out of memory (MPS allocated: 17.27 GB, other allocations: 819.94 MB, max allowed: 18.13 GB). Tried to allocate 192.00 MB on private pool. Use 
PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable the upper limit for memory allocations (may cause system failure).

I have a 16GB Mac M1 (13.4.1 (c)).

But I can see the GPU load when attempting to load the model.

@wizzard0 if you have larger memory than mine please try this

if that solves your problem let me know

cc: @pseudotensor

wizzard0 commented 1 year ago

Ohh THAT's where the init_device should be entered! 🤣 I tried to fit it into many places (it suggests the init_device "meta" so I will try that too)

wizzard0 commented 1 year ago

Um, nope. Model is loaded, but then when I click submit

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
thread exception: (<class 'RuntimeError'>, RuntimeError("User specified an unsupported autocast device_type 'mps'"), <traceback object at 0x28a42b440>)
make stop: (<class 'RuntimeError'>, RuntimeError("User specified an unsupported autocast device_type 'mps'"), <traceback object at 0x28a42b440>)
hit stop
Traceback (most recent call last):
  File "/Users/user/dev/h2o/lib/python3.10/site-packages/gradio/routes.py", line 437, in run_predict
    output = await app.get_blocks().process_api(
  File "/Users/user/dev/h2o/lib/python3.10/site-packages/gradio/blocks.py", line 1352, in process_api
    result = await self.call_function(
  File "/Users/user/dev/h2o/lib/python3.10/site-packages/gradio/blocks.py", line 1093, in call_function
    prediction = await utils.async_iteration(iterator)
  File "/Users/user/dev/h2o/lib/python3.10/site-packages/gradio/utils.py", line 341, in async_iteration
    return await iterator.__anext__()
  File "/Users/user/dev/h2o/lib/python3.10/site-packages/gradio/utils.py", line 334, in __anext__
    return await anyio.to_thread.run_sync(
  File "/Users/user/dev/h2o/lib/python3.10/site-packages/anyio/to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/Users/user/dev/h2o/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
  File "/Users/user/dev/h2o/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 807, in run
    result = context.run(func, *args)
  File "/Users/user/dev/h2o/lib/python3.10/site-packages/gradio/utils.py", line 317, in run_sync_iterator_async
    return next(iterator)
  File "/Users/user/dev/h2o/h2ogpt/src/gradio_runner.py", line 1428, in bot
    for res in get_response(fun1, history):
  File "/Users/user/dev/h2o/h2ogpt/src/gradio_runner.py", line 1385, in get_response
    for output_fun in fun1():
  File "/Users/user/dev/h2o/h2ogpt/src/gen.py", line 2009, in evaluate
    raise thread.exc
  File "/Users/user/dev/h2o/h2ogpt/src/utils.py", line 340, in run
    self._return = self._target(*self._args, **self._kwargs)
  File "/Users/user/dev/h2o/h2ogpt/src/gen.py", line 2112, in generate_with_exceptions
    func(*args, **kwargs)
  File "/Users/user/dev/h2o/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/Users/user/dev/h2o/lib/python3.10/site-packages/transformers/generation/utils.py", line 1522, in generate
    return self.greedy_search(
  File "/Users/user/dev/h2o/lib/python3.10/site-packages/transformers/generation/utils.py", line 2339, in greedy_search
    outputs = self(
  File "/Users/user/dev/h2o/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1522, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/Users/user/dev/h2o/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1531, in _call_impl
    return forward_call(*args, **kwargs)
  File "/Users/user/.cache/huggingface/modules/transformers_modules/mosaicml/mpt-7b-chat/c53dee01e05098f81cac11145f9bf45feedc5b2f/modeling_mpt.py", line 270, in forward
    outputs = self.transformer(input_ids=input_ids, past_key_values=past_key_values, attention_mask=attention_mask, prefix_mask=prefix_mask, sequence_id=sequence_id, return_dict=return_dict, output_attentions=output_attentions, output_hidden_states=output_hidden_states, use_cache=use_cache)
  File "/Users/user/dev/h2o/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1522, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/Users/user/dev/h2o/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1531, in _call_impl
    return forward_call(*args, **kwargs)
  File "/Users/user/.cache/huggingface/modules/transformers_modules/mosaicml/mpt-7b-chat/c53dee01e05098f81cac11145f9bf45feedc5b2f/modeling_mpt.py", line 202, in forward
    (x, attn_weights, past_key_value) = block(x, past_key_value=past_key_value, attn_bias=attn_bias, attention_mask=attention_mask, is_causal=self.is_causal)
  File "/Users/user/dev/h2o/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1522, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/Users/user/dev/h2o/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1531, in _call_impl
    return forward_call(*args, **kwargs)
  File "/Users/user/.cache/huggingface/modules/transformers_modules/mosaicml/mpt-7b-chat/c53dee01e05098f81cac11145f9bf45feedc5b2f/blocks.py", line 35, in forward
    a = self.norm_1(x)
  File "/Users/user/dev/h2o/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1522, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/Users/user/dev/h2o/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1531, in _call_impl
    return forward_call(*args, **kwargs)
  File "/Users/user/.cache/huggingface/modules/transformers_modules/mosaicml/mpt-7b-chat/c53dee01e05098f81cac11145f9bf45feedc5b2f/norm.py", line 24, in forward
    with torch.autocast(enabled=False, device_type=module_device.type):
  File "/Users/user/dev/h2o/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 222, in __init__
    raise RuntimeError(f'User specified an unsupported autocast device_type \'{self.device}\'')
RuntimeError: User specified an unsupported autocast device_type 'mps'
Mathanraj-Sharma commented 1 year ago

@wizzard0 what is the torch version you are using, and which h2ogpt branch you are using?

wizzard0 commented 1 year ago

init_device=meta fails similarly though

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
/Users/user/dev/h2o/lib/python3.10/site-packages/transformers/generation/utils.py:1452: UserWarning: You are calling .generate() with the `input_ids` being on a device type different than your model's device. `input_ids` is on mps, whereas the model is on meta. You may experience unexpected behaviors or slower generation. Please make sure that you have put `input_ids` to the correct device by calling for example input_ids = input_ids.to('meta') before running `.generate()`.
  warnings.warn(
thread exception: (<class 'RuntimeError'>, RuntimeError('Tensor on device meta is not on the expected device mps:0!'), <traceback object at 0x2d35a2840>)
make stop: (<class 'RuntimeError'>, RuntimeError('Tensor on device meta is not on the expected device mps:0!'), <traceback object at 0x2d35a2840>)
hit stop
Traceback (most recent call last):
  File "/Users/user/dev/h2o/lib/python3.10/site-packages/gradio/routes.py", line 437, in run_predict
    output = await app.get_blocks().process_api(
  File "/Users/user/dev/h2o/lib/python3.10/site-packages/gradio/blocks.py", line 1352, in process_api
    result = await self.call_function(
  File "/Users/user/dev/h2o/lib/python3.10/site-packages/gradio/blocks.py", line 1093, in call_function
    prediction = await utils.async_iteration(iterator)
  File "/Users/user/dev/h2o/lib/python3.10/site-packages/gradio/utils.py", line 341, in async_iteration
    return await iterator.__anext__()
  File "/Users/user/dev/h2o/lib/python3.10/site-packages/gradio/utils.py", line 334, in __anext__
    return await anyio.to_thread.run_sync(
  File "/Users/user/dev/h2o/lib/python3.10/site-packages/anyio/to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/Users/user/dev/h2o/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
  File "/Users/user/dev/h2o/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 807, in run
    result = context.run(func, *args)
  File "/Users/user/dev/h2o/lib/python3.10/site-packages/gradio/utils.py", line 317, in run_sync_iterator_async
    return next(iterator)
  File "/Users/user/dev/h2o/h2ogpt/src/gradio_runner.py", line 1428, in bot
    for res in get_response(fun1, history):
  File "/Users/user/dev/h2o/h2ogpt/src/gradio_runner.py", line 1385, in get_response
    for output_fun in fun1():
  File "/Users/user/dev/h2o/h2ogpt/src/gen.py", line 2009, in evaluate
    raise thread.exc
  File "/Users/user/dev/h2o/h2ogpt/src/utils.py", line 340, in run
    self._return = self._target(*self._args, **self._kwargs)
  File "/Users/user/dev/h2o/h2ogpt/src/gen.py", line 2112, in generate_with_exceptions
    func(*args, **kwargs)
  File "/Users/user/dev/h2o/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/Users/user/dev/h2o/lib/python3.10/site-packages/transformers/generation/utils.py", line 1522, in generate
    return self.greedy_search(
  File "/Users/user/dev/h2o/lib/python3.10/site-packages/transformers/generation/utils.py", line 2339, in greedy_search
    outputs = self(
  File "/Users/user/dev/h2o/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1522, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/Users/user/dev/h2o/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1531, in _call_impl
    return forward_call(*args, **kwargs)
  File "/Users/user/.cache/huggingface/modules/transformers_modules/mosaicml/mpt-7b-chat/c53dee01e05098f81cac11145f9bf45feedc5b2f/modeling_mpt.py", line 270, in forward
    outputs = self.transformer(input_ids=input_ids, past_key_values=past_key_values, attention_mask=attention_mask, prefix_mask=prefix_mask, sequence_id=sequence_id, return_dict=return_dict, output_attentions=output_attentions, output_hidden_states=output_hidden_states, use_cache=use_cache)
  File "/Users/user/dev/h2o/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1522, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/Users/user/dev/h2o/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1531, in _call_impl
    return forward_call(*args, **kwargs)
  File "/Users/user/.cache/huggingface/modules/transformers_modules/mosaicml/mpt-7b-chat/c53dee01e05098f81cac11145f9bf45feedc5b2f/modeling_mpt.py", line 192, in forward
    (attn_bias, attention_mask) = self._attn_bias(device=x.device, dtype=torch.float32, attention_mask=attention_mask, prefix_mask=prefix_mask, sequence_id=sequence_id)
  File "/Users/user/dev/h2o/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/Users/user/.cache/huggingface/modules/transformers_modules/mosaicml/mpt-7b-chat/c53dee01e05098f81cac11145f9bf45feedc5b2f/modeling_mpt.py", line 115, in _attn_bias
    attn_bias = attn_bias.masked_fill(~attention_mask.view(-1, 1, 1, s_k), min_val)
  File "/Users/user/dev/h2o/lib/python3.10/site-packages/torch/_refs/__init__.py", line 5077, in masked_fill
    r = torch.where(mask, value, a)  # type: ignore[arg-type]
  File "/Users/user/dev/h2o/lib/python3.10/site-packages/torch/_prims_common/wrappers.py", line 227, in _fn
    result = fn(*args, **kwargs)
  File "/Users/user/dev/h2o/lib/python3.10/site-packages/torch/_prims_common/wrappers.py", line 130, in _fn
    result = fn(**bound.arguments)
  File "/Users/user/dev/h2o/lib/python3.10/site-packages/torch/_refs/__init__.py", line 1872, in where
    utils.check_same_device(pred, a, b, allow_cpu_scalar_tensors=True)
  File "/Users/user/dev/h2o/lib/python3.10/site-packages/torch/_prims_common/__init__.py", line 673, in check_same_device
    raise RuntimeError(msg)
RuntimeError: Tensor on device meta is not on the expected device mps:0!
Mathanraj-Sharma commented 1 year ago

@wizzard0 please give me below

  1. The torch version you are using
  2. The h2ogpt branch you are at
  3. Your Mac specs
wizzard0 commented 1 year ago

1.

python -c "import torch; print(torch.__version__)"

2.1.0.dev20230716
  1. on branch main, just updated ff674651c84ab4083b352bfd1e21fd3999a74e51
  2. Apple M2 Max 96GB, Ventura 13.4.1 (c)
Mathanraj-Sharma commented 1 year ago

cc: @wizzard0 if you are on a main (just updated) could you please create a fresh conda environment and try running the model with https://github.com/h2oai/h2ogpt/issues/463#issuecomment-1640550956

wizzard0 commented 1 year ago

@Mathanraj-Sharma yes that's what I just did, one stacktrace is with init_device=mps, another with init_device meta. Both stacktraces are after submitting the prompt

Mathanraj-Sharma commented 1 year ago

@wizzard0 could you please also change torch_dtype in config.json to float16 and see.

just a hunch https://github.com/pytorch/pytorch/issues/78168#issuecomment-1137686403

wizzard0 commented 1 year ago

@wizzard0 could you please also change torch_dtype in config.json to float16 and see.

just a hunch pytorch/pytorch#78168 (comment)

No change, still User specified an unsupported autocast device_type 'mps'

Mathanraj-Sharma commented 1 year ago

@wizzard0 as I mentioned earlier you can load the model to MPS by changing init_device in config.json of the transformer.

The error related to autocast is from pytorch. torch.autocast does not support MPS yet

As mitigation in h2ogpt, we pass a NullContext in such a situation. https://github.com/h2oai/h2ogpt/blob/70a377c763abd278e57eced1a7aff7f468676ed7/src/gen.py#L1942

MPT model is triggering the error

 File "/Users/user/.cache/huggingface/modules/transformers_modules/mosaicml/mpt-7b-chat/c53dee01e05098f81cac11145f9bf45feedc5b2f/norm.py", line 24, in forward
    with torch.autocast(enabled=False, device_type=module_device.type):

So this needs to be fixed from MPT's side

cc: @pseudotensor

Mathanraj-Sharma commented 11 months ago

@pseudotensor closing it since no active, and problem looks like need to be fixed from MPT's end