intel / intel-extension-for-pytorch

A Python package for extending the official PyTorch that can easily obtain performance on Intel platform
Apache License 2.0
1.59k stars 244 forks source link

VIDEO_SCHEDULER_INTERNAL_ERROR #544

Closed pmusser closed 3 months ago

pmusser commented 7 months ago

Describe the bug

After following the steps included in the blog post that came out a few days ago, I modified the code to try and interact with the google-gemma-7b model in a Jupyter notebook. Code as follows:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

############# code changes ###############
# import ipex
import intel_extension_for_pytorch as ipex
# verify Intel Arc GPU
print(ipex.xpu.get_device_name(0))
##########################################

# load model
model_id = "google/gemma-7b"
dtype = torch.float16

tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b")
model = AutoModelForCausalLM.from_pretrained("google/gemma-7b")

############# code changes ###############
# move to Intel Arc GPU
model = model.eval().to("xpu")
##########################################

# generate 
with torch.inference_mode(), torch.no_grad(), torch.autocast(
        ############# code changes ###############
        device_type="xpu",
        ##########################################
        enabled=True,
        dtype=dtype
    ):
    text = "You may have heard of Schrodinger cat mentioned in a thought experiment in quantum physics. Briefly, according to the Copenhagen interpretation of quantum mechanics, the cat in a sealed box is simultaneously alive and dead until we open the box and observe the cat. The macrostate of cat (either alive or dead) is determined at the moment we observe the cat."
    input_ids = tokenizer(text, return_tensors="pt").input_ids
    ############# code changes ###############
    # move to Intel Arc GPU
    input_ids = input_ids.to("xpu")
    ##########################################
    generated_ids = model.generate(input_ids, max_new_tokens=128)[0]
    generated_text = tokenizer.decode(generated_ids, skip_special_tokens=True)

print(generated_text)

The code started executing successfully, but after the the model information got transferred to the GPU my computer started having some artifacts (chrome windows blanking and resizing), and shortly after the computer went to a BSOD of VIDEO_SCHEDULER_INTERNAL_ERROR, as follows:

The computer has rebooted from a bugcheck. The bugcheck was: 0x00000119 (0x0000000000000005, 0xffffe30e54c27000, 0xffffe30e5468a030, 0x0000000000050ec1). A dump was saved in: C:\WINDOWS\MEMORY.DMP. Report Id: 96e4c860-9dc4-49b8-a14f-e02f85d20f5e.

Versions

PyTorch version: 2.1.0a0+cxx11.abi PyTorch CXX11 ABI: No IPEX version: 2.1.10+xpu IPEX commit: a12f9f650 Build type: Release

OS: Microsoft Windows 11 Pro GCC version: N/A Clang version: N/A IGC version: 2024.0.2 (2024.0.2.20231213) CMake version: version 3.28.0-msvc1 Libc version: N/A

Python version: 3.9.18 (main, Sep 11 2023, 14:09:26) [MSC v.1916 64 bit (AMD64)] (64-bit runtime) Python platform: Windows-10-10.0.22631-SP0 Is XPU available: True DPCPP runtime version: N/A MKL version: N/A GPU models and configuration: [0] _DeviceProperties(name='Intel(R) Arc(TM) A770 Graphics', platform_name='Intel(R) Level-Zero', dev_type='gpu, support_fp64=0, total_memory=15930MB, max_compute_units=512, gpu_eu_count=512) Intel OpenCL ICD version: N/A Level Zero version: N/A

CPU: Architecture=9 CurrentClockSpeed=3600 DeviceID=CPU0 Family=107 L2CacheSize=4096 L2CacheSpeed= Manufacturer=AuthenticAMD MaxClockSpeed=3600 Name=AMD Ryzen 7 3700X 8-Core Processor ProcessorType=3 Revision=28928

Versions of relevant libraries: [pip3] intel-extension-for-pytorch==2.1.10+xpu [pip3] numpy==1.26.4 [pip3] torch==2.1.0a0+cxx11.abi [conda] intel-extension-for-pytorch 2.1.10+xpu pypi_0 pypi [conda] numpy 1.26.4 pypi_0 pypi [conda] torch 2.1.0a0+cxx11.abi pypi_0 pypi

pmusser commented 7 months ago

Just a note, I also tried with EleutherAI/gpt-j-6b and the issue happened again.

pmusser commented 7 months ago

Don't know if it's helpful, but I had the performance tab in task manager open with the GPU selected, and recorded a video of what it was doing when it crashed. This is the last frame before BSOD, looks like it crashes right when GPU Dedicated memory usage hits 100% Screenshot 2024-02-27 162606

kta-intel commented 7 months ago

Seems like it might be an issue with dedicated gpu memory based on that screenshot. We can try to reproduce on our end. Were you able to run it with the Llama-2-7b-hf model from the blog? And if so, how was your memory usage during that run?

pmusser commented 7 months ago

@kta-intel not yet, I meant to request access but hadn't done so yet. Will try ASAP (huggingface site is down presently).

Incidentally I also discovered that if I don't use IPEX but use the Arc A770 when using the zero-shot-classification pipeline with facebook/bart-large-mnli, MoritzLaurer/mDeBERTa-v3-base-mnli-xnli, or MoritzLaurer/DeBERTa-v3-large-mnli-fever-anli-ling-wanli the same thing happens -- but only when I shut down the python kernel it was running in. It otherwise runs without issue!

pmusser commented 7 months ago

@kta-intel Success, after a fashion -- I got llama-2-7b-hf installed and tried a few times to get it to work. The first several times didn't; the error I got stated that protobuf was required but not installed. I had some trouble getting int installed in the environment I'd set up following the blog instructions and recognized by jupyter -- ultimately had to activate environment, load jupyter notebook, open a console from within notebook and pip install protobuf there, then restart kernel but now it works, except the following error in output after checkpoint shards and before printing output:

Intel(R) Arc(TM) A770 Graphics
Loading checkpoint shards: 100%
 2/2 [00:00<00:00,  7.16it/s]
~~Keyword arguments {'add_special_tokens': False} not recognized.~~
You may have heard of Schrodinger cat mentioned in a thought experiment in quantum physics. Briefly, according to the Copenhagen interpretation of quantum mechanics, the cat in a sealed box is simultaneously alive and dead until we open the box and observe the cat. The macrostate of cat (either alive or dead) is determined at the moment we observe the cat. This is called the Copenhagen interpretation of quantum mechanics.
The quantum world is so strange that it is difficult to understand. This is because the quantum world is so different from our normal everyday world. For example, the quantum world is so strange that it is difficult to understand. In the quantum world, particles can be in different places at the same time. This is called superposition. This is a very strange thing, because in our everyday world, we can only be in one place at a time.
Another strange thing about the quantum world is that particles can be in different states at the same time.

Haven't closed kernel yet to see if BSOD, will let you know momentarily EDIT: closing kernel doesn't go to BSOD!

pmusser commented 7 months ago

Looks like missing protobuf may have been the root; once I installed it I was able to successfully (well, sort of -- see below) run the original code without going to BSOD.

Now it looks like I just have to grapple with a lack of sufficient memory for google/gemma-7b:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[1], line 37
     35     input_ids = input_ids.to("xpu")
     36     ##########################################
---> 37     generated_ids = model.generate(input_ids, max_new_tokens=128)[0]
     38     generated_text = tokenizer.decode(generated_ids, skip_special_tokens=True)
     40 print(generated_text)

File [~\.conda\envs\llm\lib\site-packages\torch\utils\_contextlib.py:115](http://localhost:8888/lab/tree/_Jupyter%20books/Pressbooks%20Conforming/Blank%20subjects/~/.conda/envs/llm/lib/site-packages/torch/utils/_contextlib.py#line=114), in context_decorator.<locals>.decorate_context(*args, **kwargs)
    112 @functools.wraps(func)
    113 def decorate_context(*args, **kwargs):
    114     with ctx_factory():
--> 115         return func(*args, **kwargs)

File [~\.conda\envs\llm\lib\site-packages\transformers\generation\utils.py:1392](http://localhost:8888/lab/tree/_Jupyter%20books/Pressbooks%20Conforming/Blank%20subjects/~/.conda/envs/llm/lib/site-packages/transformers/generation/utils.py#line=1391), in GenerationMixin.generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, assistant_model, streamer, negative_prompt_ids, negative_prompt_attention_mask, **kwargs)
   1389 requires_attention_mask = "encoder_outputs" not in model_kwargs
   1391 if model_kwargs.get("attention_mask", None) is None and requires_attention_mask and accepts_attention_mask:
-> 1392     model_kwargs["attention_mask"] = self._prepare_attention_mask_for_generation(
   1393         inputs_tensor, generation_config.pad_token_id, generation_config.eos_token_id
   1394     )
   1396 # decoder-only models should use left-padding for generation
   1397 if not self.config.is_encoder_decoder:
   1398     # If `input_ids` was given, check if the last id in any sequence is `pad_token_id`
   1399     # Note: If using, `inputs_embeds` this check does not work, because we want to be more hands-off.

File [~\.conda\envs\llm\lib\site-packages\transformers\generation\utils.py:476](http://localhost:8888/lab/tree/_Jupyter%20books/Pressbooks%20Conforming/Blank%20subjects/~/.conda/envs/llm/lib/site-packages/transformers/generation/utils.py#line=475), in GenerationMixin._prepare_attention_mask_for_generation(self, inputs, pad_token_id, eos_token_id)
    469 def _prepare_attention_mask_for_generation(
    470     self,
    471     inputs: torch.Tensor,
    472     pad_token_id: Optional[int],
    473     eos_token_id: Optional[Union[int, List[int]]],
    474 ) -> torch.LongTensor:
    475     is_input_ids = len(inputs.shape) == 2 and inputs.dtype in [torch.int, torch.long]
--> 476     is_pad_token_in_inputs = (pad_token_id is not None) and (pad_token_id in inputs)
    477     if isinstance(eos_token_id, int):
    478         eos_token_id = [eos_token_id]

File [~\.conda\envs\llm\lib\site-packages\torch\_tensor.py:1059](http://localhost:8888/lab/tree/_Jupyter%20books/Pressbooks%20Conforming/Blank%20subjects/~/.conda/envs/llm/lib/site-packages/torch/_tensor.py#line=1058), in Tensor.__contains__(self, element)
   1054     return handle_torch_function(Tensor.__contains__, (self,), self, element)
   1055 if isinstance(
   1056     element, (torch.Tensor, Number, torch.SymInt, torch.SymFloat, torch.SymBool)
   1057 ):
   1058     # type hint doesn't understand the __contains__ result array
-> 1059     return (element == self).any().item()  # type: ignore[union-attr]
   1061 raise RuntimeError(
   1062     f"Tensor.__contains__ only supports Tensor or scalar, but you passed in a {type(element)}."
   1063 )

RuntimeError: Allocation is out of device memory on current platform.
kta-intel commented 5 months ago

Hey, sorry for the delay. Glad that the original issue was resolved. Regarding OOM, have you tried quantizing the model and seeing if it's able to run? ex. https://intel.github.io/intel-extension-for-pytorch/cpu/latest/tutorials/llm/llm_optimize.html#weight-only-quantization-woq

jingxu10 commented 3 months ago

close for long time no response. Feel free to reopen it if needed.