intel / intel-extension-for-pytorch

A Python package for extending the official PyTorch that can easily obtain performance on Intel platform
Apache License 2.0
1.56k stars 237 forks source link

Invalid output and errors using model = ipex.optimize(model): split master weight unsupported, Conv BatchNorm folding failed, Linear BatchNorm folding failed #302

Open nathanodle opened 1 year ago

nathanodle commented 1 year ago

Hi, trying to run inference with a pretrained OFA (OFA-huge) model according to these instructions:

https://github.com/OFA-Sys/OFA/blob/feature/add_transformers/transformers.md

This runs fine on both CPU and CUDA but using XPU results in gibberish. I also get several warnings which go away when model = ipex.optimize(model) is commented out. With essentially the only change between CPU/CUDA and XPU being the .to('xpu') part, the model still outputs gibberish.

Warnings from model = ipex.optimize(model):

  warnings.warn(
./OFA-huge
<super: <class 'OFATokenizer'>, <OFATokenizer object>>
/home/mediamatik/.virtualenvs/keplermatik_whisper_api/lib/python3.10/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/workspace/pytorch/aten/src/ATen/native/TensorShape.cpp:3190.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
/home/mediamatik/.virtualenvs/keplermatik_whisper_api/lib/python3.10/site-packages/intel_extension_for_pytorch/frontend.py:447: UserWarning: For XPU device, the split master weight is unsupported for now, so temp to disable it
  warnings.warn("For XPU device, the split master weight is unsupported for now, so temp to disable it")
/home/mediamatik/.virtualenvs/keplermatik_whisper_api/lib/python3.10/site-packages/intel_extension_for_pytorch/frontend.py:457: UserWarning: For XPU device to save valuable device memory, temp to do optimization on inplaced model, so                     make inplace to be true
  warnings.warn(
/home/mediamatik/.virtualenvs/keplermatik_whisper_api/lib/python3.10/site-packages/intel_extension_for_pytorch/frontend.py:464: UserWarning: For XPU, the weight prepack and sample input are disabled. The onednn layout                     is automatically chosen to use
  warnings.warn(
/home/mediamatik/.virtualenvs/keplermatik_whisper_api/lib/python3.10/site-packages/intel_extension_for_pytorch/frontend.py:486: UserWarning: Conv BatchNorm folding failed during the optimize process.
  warnings.warn("Conv BatchNorm folding failed during the optimize process.")
/home/mediamatik/.virtualenvs/keplermatik_whisper_api/lib/python3.10/site-packages/intel_extension_for_pytorch/frontend.py:491: UserWarning: Linear BatchNorm folding failed during the optimize process.
  warnings.warn("Linear BatchNorm folding failed during the optimize process.")

[' this is the ch ch chaval all the is is the word for the band that is'] ^ gibberish output

With CPU/CUDA: [' a black and white photo of a wolf walking through the woods at night.'] ^ correct output

I'm running Ubuntu 22.04 with 1.13.10+xpu, code is below:

import warnings
from PIL import Image
from torchvision import transforms
from transformers import OFATokenizer, OFAModel
import intel_extension_for_pytorch as ipex

chkpt_dir = "./OFA-huge"
path_to_image = "image.jpg"
mean, std = [0.5, 0.5, 0.5], [0.5, 0.5, 0.5]
resolution = 256
patch_resize_transform = transforms.Compose([
        lambda image: image.convert("RGB"),
        transforms.Resize((resolution, resolution), interpolation=Image.BICUBIC),
        transforms.ToTensor(),
        transforms.Normalize(mean=mean, std=std)
    ])

tokenizer = OFATokenizer.from_pretrained(chkpt_dir)

txt = " what does the image describe?"
inputs = tokenizer([txt], return_tensors="pt").input_ids
img = Image.open(path_to_image)
patch_img = patch_resize_transform(img).unsqueeze(0)

model = OFAModel.from_pretrained(chkpt_dir, use_cache=False)
model = model.to("xpu")
patch_img = patch_img.to("xpu")
inputs = inputs.to("xpu")
model = ipex.optimize(model)

gen = model.generate(inputs, patch_images=patch_img, num_beams=5, no_repeat_ngram_size=3)

print(tokenizer.batch_decode(gen, skip_special_tokens=True))

Image: image

Thanks!

jingxu10 commented 1 year ago

Which GPU did you run on?

jingxu10 commented 1 year ago

We will look into this issue.

nathanodle commented 1 year ago

Which GPU did you run on?

Sorry, I should have mentioned that. Arc 770, latest drivers on Ubuntu.

Thank you very much for looking into this, I really appreciate it!

nathanodle commented 1 year ago

Is there an eta for someone to look at this? Just curious as I have a project I'm trying to validate on ARC. Thanks!

jingxu10 commented 1 year ago

We are looking into this issue, and will update later. Seems like there are some issues found.

leuc commented 1 year ago

similar issue while trying to run openai-whisper on A770

     from . import load_model
+    import intel_extension_for_pytorch as ipex

     model = load_model(model_name, device=device, download_root=model_dir)
+    model.eval()
+    model = model.to('xpu')
+    ipex.optimize(model)
whisper --model tiny --language en --task transcribe --device xpu ...

results in

intel_extension_for_pytorch/frontend.py:264: UserWarning: Conv BatchNorm folding failed during the optimize process.
intel_extension_for_pytorch/frontend.py:277: UserWarning: pending the optimization for LSTM

Whipser then fails to decode the tokens.

torch                       1.10.0a0+git3d5f2d4
intel-extension-for-pytorch 1.10.200+gpu
. /opt/intel/oneapi/tbb/2021.8.0/env/vars.sh
. /opt/intel/oneapi/compiler/2022.2.0/env/vars.sh
. /opt/intel/oneapi/mkl/2022.2.0/env/vars.sh
> sycl-ls
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device 1.2 [2022.14.7.0.30_160000]
[opencl:cpu:1] Intel(R) OpenCL, AMD Ryzen 9 5900X 12-Core Processor             3.0 [2022.14.7.0.30_160000]
[opencl:gpu:2] Intel(R) OpenCL HD Graphics, Intel(R) Graphics [0x56a0] 3.0 [22.49.25018.23]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Graphics [0x56a0] 1.3 [1.3.25018]
[host:host:0] SYCL host platform, SYCL host device 1.2 [1.2]

> uname -a
Linux 5.17.0-1020-oem #21-Ubuntu SMP PREEMPT Fri Oct 14 09:33:24 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
nathanodle commented 1 year ago

Update, I have also tried this with an Intel I9-11900K CPU and A770 with the same result. The first attempt was using an AMD Threadripper. The code does not work on either platform. Is there a timeline for this issue? Thanks so much!

jingxu10 commented 1 year ago

This issue will be fixed in the next release soon.

nathanodle commented 1 year ago

Just a note, I have gotten bad results with every single model I've tried to use with XPU, it's not limited to this model. From my perspective, ARC has been unusable for almost 2 months now. I bought 6 Arc A770s for a project and this has been a waste so far.

I understand that I'm just one user and your team has their own plan. Can you give me anything to help me use these cards though? Is there a branch I can try or at least can you provide a release date so I know if I should continue trying with this hardware? Thanks very much!

jingxu10 commented 1 year ago

This incorrect output issue had been fixed in the latest code base. The next release is pending, though, you can try compile from source at this moment with https://github.com/intel/intel-extension-for-pytorch/blob/xpu-master/scripts/compile_bundle.sh. You need to use oneAPI basekit 2023.1 and with driver 602. https://dgpu-docs.intel.com/releases/stable_602_20230323.html

jingxu10 commented 1 year ago

similar issue while trying to run openai-whisper on A770

     from . import load_model
+    import intel_extension_for_pytorch as ipex

     model = load_model(model_name, device=device, download_root=model_dir)
+    model.eval()
+    model = model.to('xpu')
+    ipex.optimize(model)
whisper --model tiny --language en --task transcribe --device xpu ...

results in

intel_extension_for_pytorch/frontend.py:264: UserWarning: Conv BatchNorm folding failed during the optimize process.
intel_extension_for_pytorch/frontend.py:277: UserWarning: pending the optimization for LSTM

Whipser then fails to decode the tokens.

torch                       1.10.0a0+git3d5f2d4
intel-extension-for-pytorch 1.10.200+gpu
. /opt/intel/oneapi/tbb/2021.8.0/env/vars.sh
. /opt/intel/oneapi/compiler/2022.2.0/env/vars.sh
. /opt/intel/oneapi/mkl/2022.2.0/env/vars.sh
> sycl-ls
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device 1.2 [2022.14.7.0.30_160000]
[opencl:cpu:1] Intel(R) OpenCL, AMD Ryzen 9 5900X 12-Core Processor             3.0 [2022.14.7.0.30_160000]
[opencl:gpu:2] Intel(R) OpenCL HD Graphics, Intel(R) Graphics [0x56a0] 3.0 [22.49.25018.23]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Graphics [0x56a0] 1.3 [1.3.25018]
[host:host:0] SYCL host platform, SYCL host device 1.2 [1.2]

> uname -a
Linux 5.17.0-1020-oem #21-Ubuntu SMP PREEMPT Fri Oct 14 09:33:24 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Hi, at this moment, please try compiling the latest code from source for now. Please take a reference to the comment above.

leuc commented 1 year ago

compilation took hours and multiple attempts, but whisper is working with the xpu-master branch and even loads the large model into the 16GB VRAM.

$ whisper --language en --model large --device xpu some.mp3
python3.10/site-packages/intel_extension_for_pytorch/frontend.py:484: UserWarning: Split Master Weight feature is not supported on XPU for now, disabled.
python3.10/site-packages/intel_extension_for_pytorch/frontend.py:494: UserWarning: To reduce device memory usage on XPU, optimization are done inplace, setting the inplace argument to True.
python3.10/site-packages/intel_extension_for_pytorch/frontend.py:500: UserWarning: Weight Prepack and Sample Input are both disabled on XPU. The Onednn Layout is automatically applied.
python3.10/site-packages/intel_extension_for_pytorch/frontend.py:506: UserWarning: For XPU, the optimize_lstm(replace lstm with ipex_lstm) is unsupported, so disable it
python3.10/site-packages/intel_extension_for_pytorch/frontend.py:526: UserWarning: Conv BatchNorm folding failed during the optimize process.
python3.10/site-packages/intel_extension_for_pytorch/frontend.py:531: UserWarning: Linear BatchNorm folding failed during the optimize process.

speed looks ok'ish, but given the warnings probably room for improvement.

intel_gpu_top shows 52% Render, 75% Blitter, 24% unknown.

whisper patch

diff --git a/whisper/transcribe.py b/whisper/transcribe.py
index ed6d820..0d9e3c8 100644
--- a/whisper/transcribe.py
+++ b/whisper/transcribe.py
@@ -429,8 +429,13 @@ def cli():
         torch.set_num_threads(threads)

     from . import load_model
+    import intel_extension_for_pytorch as ipex

     model = load_model(model_name, device=device, download_root=model_dir)
+    model.eval()
+    model = model.to(device)
+    if device == 'xpu':
+        ipex.optimize(model)

     writer = get_writer(output_format, output_dir)
     for audio_path in args.pop("audio"):

python modules

openai-whisper              20230314
intel-extension-for-pytorch 1.13.120+git5fdf9e6
torch                       1.13.0a0+git49444c3
torchaudio                  0.13.1+b90d798
torchvision                 0.14.1a0+5e8e2f1
> sycl-ls
[opencl:gpu:0] Intel(R) OpenCL HD Graphics, Intel(R) Arc(TM) A770 Graphics 3.0 [23.05.25593.18]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.25593]

apt packages

intel-i915-dkms 1.23.3.19.230122.18.5.17.0.1020+i38-1
intel-dpcpp-cpp-compiler-2023.1.0 2023.1.0-46347
intel-oneapi-mkl-2023.1.0         2023.1.0-46342
intel-oneapi-mkl-devel-2023.1.0   2023.1.0-46342
kernel 5.17.0-1020-oem
leuc commented 1 year ago

above warnings go away when ipex.optimize(model) is omitted

found a metric to display GPU memory usage using lsgpu

normal usage

> lsgpu -p | grep ^lmem_
lmem_avail_bytes                : 16260284416
lmem_total_bytes                : 17079205888

openai whisper large mode loaded

lmem_avail_bytes                : 4605845504
lmem_total_bytes                : 17079205888
leuc commented 1 year ago

took hours to build, so uploaded unofficial wheels of xpu-master here: https://github.com/leuc/intel-extension-for-pytorch/releases/tag/v1.13.120%2Bgit5fdf9e6

fredlarochelle commented 1 year ago

@leuc How much RAM does your computer possess? It builds in around 20-25min on my workstation, utilizing slightly under 20GB of memory. However, when attempting building using a Github Actions I made (per Github docs, the VM has 7GB of memory) or a self-hosted runner on a laptop with 8GB of RAM, I didn't even get a build to finish.

@jingxu10 Having something akin to a nightly beta build from Intel could be really useful here.

leuc commented 1 year ago

@fredlarochelle it wasn't a resource issue, but the script doesn't build well without conda. I may work on a PR for better portability, with aim for CI/CD and containers.

fredlarochelle commented 1 year ago

@leuc Yeah, I know about conda + the GCC 11 requirement, however I had no luck with GCC 11, not consistent at all, got it working way better with GCC 9. We should probably have a look into the compiler flags used too.

jingxu10 commented 1 year ago

what are error messages? I would recommend to do the compilation in a docker container.

leuc commented 1 year ago

what are error messages? I would recommend to do the compilation in a docker container.

addressed some build issues with PR https://github.com/intel/intel-extension-for-pytorch/pull/334

turbobuilt commented 1 year ago

I'm using a tiny test network that is just one linear layer. Using the updated build I still get:

/usr/local/lib/python3.10/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: warn(f"Failed to load image Python extension: {e}") /usr/local/lib/python3.10/dist-packages/intel_extension_for_pytorch/frontend.py:484: UserWarning: Split Master Weight feature is not supported on XPU for now, disabled. warnings.warn("Split Master Weight feature is not supported on XPU for now, disabled.") /usr/local/lib/python3.10/dist-packages/intel_extension_for_pytorch/frontend.py:500: UserWarning: Weight Prepack and Sample Input are both disabled on XPU. The Onednn Layout is automatically applied. warnings.warn( /usr/local/lib/python3.10/dist-packages/intel_extension_for_pytorch/frontend.py:506: UserWarning: For XPU, the optimize_lstm(replace lstm with ipex_lstm) is unsupported, so disable it

I don't know how this is possible because there's no LSTM at all!


import torch
from torch import nn
from torch.utils.data import DataLoader, Dataset
import math
import os
import glob
import random
import librosa
import soundfile as sf
import numpy as np

import intel_extension_for_pytorch as ipex

default_device = torch.device("xpu")

class DummyLayer(nn.Module):
    def __init__(self):
        super(DummyLayer, self).__init__()
        self.layer = nn.Linear(1, 1)

    def forward(self, src):
        src = src.unsqueeze(-1)
        src = self.layer(src)
        src = src.squeeze(-1)
        return src

model = DummyLayer()
model.to(default_device)
criterion = nn.MSELoss()
lr_factor = 0.1
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

model, optimizer = ipex.optimize(model, optimizer=optimizer, dtype=torch.bfloat16, inplace=True)

target_sample_rate=8000
def load_file(path):
    data, sample_rate = librosa.load(path, sr=target_sample_rate)
    data = torch.from_numpy(data)
    data = data.unsqueeze(0)
    data = torch.mean(data.to(default_device), dim=0).unsqueeze(0)

    return data

train = load_file("testrecording_8k.wav")
target = load_file("testrecording_target_8k.wav")

# Training loop
num_epochs = 150000
for epoch in range(num_epochs):

    print("running")

    # batch = batch.to(memory_format=torch.channels_last)
    # target = target.to(memory_format=torch.channels_last)
    train = train.bfloat16()
    target = target.bfloat16()

    optimizer.zero_grad()
    with torch.xpu.amp.autocast(enabled=True, dtype=torch.bfloat16):
        output = model(train)

    loss = criterion(output, target)

    print(f'Epoch: {epoch+1}/{num_epochs}, Step: {epoch+1}, Loss: {loss.item()}')

    print("output", output.cpu())
    print("target", target.cpu())
    loss.backward()
    optimizer.step()

    print(f'Epoch: {epoch+1}/{num_epochs}, Step: {epoch+1}, Loss: {loss.item()}')

    # every few steps save the output
    if (epoch+1) % 50 == 0:
        # Save the output to file
        output = torch.flatten(output, start_dim=0)

        print(output.size())
        sf.write("samples2/testrecording_8k_progress2_" + str(epoch) + ".wav", output.float().cpu().detach().numpy(), target_sample_rate)
gujinghui commented 1 year ago

@zejun-chen Is this a known issue we already fixed?

zejun-chen commented 1 year ago

I'm using a tiny test network that is just one linear layer. Using the updated build I still get:

/usr/local/lib/python3.10/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: warn(f"Failed to load image Python extension: {e}") /usr/local/lib/python3.10/dist-packages/intel_extension_for_pytorch/frontend.py:484: UserWarning: Split Master Weight feature is not supported on XPU for now, disabled. warnings.warn("Split Master Weight feature is not supported on XPU for now, disabled.") /usr/local/lib/python3.10/dist-packages/intel_extension_for_pytorch/frontend.py:500: UserWarning: Weight Prepack and Sample Input are both disabled on XPU. The Onednn Layout is automatically applied. warnings.warn( /usr/local/lib/python3.10/dist-packages/intel_extension_for_pytorch/frontend.py:506: UserWarning: For XPU, the optimize_lstm(replace lstm with ipex_lstm) is unsupported, so disable it

I don't know how this is possible because there's no LSTM at all!

import torch
from torch import nn
from torch.utils.data import DataLoader, Dataset
import math
import os
import glob
import random
import librosa
import soundfile as sf
import numpy as np

import intel_extension_for_pytorch as ipex

default_device = torch.device("xpu")

class DummyLayer(nn.Module):
    def __init__(self):
        super(DummyLayer, self).__init__()
        self.layer = nn.Linear(1, 1)

    def forward(self, src):
        src = src.unsqueeze(-1)
        src = self.layer(src)
        src = src.squeeze(-1)
        return src

model = DummyLayer()
model.to(default_device)
criterion = nn.MSELoss()
lr_factor = 0.1
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

model, optimizer = ipex.optimize(model, optimizer=optimizer, dtype=torch.bfloat16, inplace=True)

target_sample_rate=8000
def load_file(path):
    data, sample_rate = librosa.load(path, sr=target_sample_rate)
    data = torch.from_numpy(data)
    data = data.unsqueeze(0)
    data = torch.mean(data.to(default_device), dim=0).unsqueeze(0)

    return data

train = load_file("testrecording_8k.wav")
target = load_file("testrecording_target_8k.wav")

# Training loop
num_epochs = 150000
for epoch in range(num_epochs):

    print("running")

    # batch = batch.to(memory_format=torch.channels_last)
    # target = target.to(memory_format=torch.channels_last)
    train = train.bfloat16()
    target = target.bfloat16()

    optimizer.zero_grad()
    with torch.xpu.amp.autocast(enabled=True, dtype=torch.bfloat16):
        output = model(train)

    loss = criterion(output, target)

    print(f'Epoch: {epoch+1}/{num_epochs}, Step: {epoch+1}, Loss: {loss.item()}')

    print("output", output.cpu())
    print("target", target.cpu())
    loss.backward()
    optimizer.step()

    print(f'Epoch: {epoch+1}/{num_epochs}, Step: {epoch+1}, Loss: {loss.item()}')

    # every few steps save the output
    if (epoch+1) % 50 == 0:
        # Save the output to file
        output = torch.flatten(output, start_dim=0)

        print(output.size())
        sf.write("samples2/testrecording_8k_progress2_" + str(epoch) + ".wav", output.float().cpu().detach().numpy(), target_sample_rate)

Hi, @turbobuilt Thank you for using IPEX. The warning message is thrown by model, optimizer = ipex.optimize(model, optimizer=optimizer, dtype=torch.bfloat16, inplace=True). This interface contains most of the IPEX optimization for model. It has a argument named level, which is default O1. For O1, most optimization will be enabled even if the model has no such layers. For XPU, some optimizations are disabled(For CPU, they are enabled), for example, split master weight(we will support it soon), weight prepack and optimize lstm, thus there are some warning messages because these optimizations are disabled for XPU.

@gujinghui This is caused by our warning messages from ipex.optimize.