Pytorch inference time issue

rubende commented 1 year ago

Hello,

First of all, I would like to clarify that this is an open issue following the recommendation received in this other oneDNN issue.

As I said in that issue, I am having a big difference in inference times between the same model defined with oneDNN and defined in pytorch and exported for CPU with the "intel-extension-for-pytorch" library.

For now, here is the information I can provide:

The exported model is using the "intel-extension-for-pytorch" library.
The exported model is using the "oneDNN" library.
I have not been able to verify that the exported model is using the "oneDNN Graph API" library.

The procedure I am following with the model in Python to prepare it for CPU is this:

model = model.to(“cpu”)
torch.jit.enable_onednn_fusion(True)
model.eval()
model = ipex.optimize(model, dtype=torch.float32)
model = torch.jit.trace(model, torch.rand(shape))
model = torch.jit.freeze(model _scripted)
torch.jit.save(model, path)

While the CPU-ready model has an inference time of about 300 microseconds, the same model defined directly in oneDNN takes only 20 microseconds.

Is this difference normal? If not, what could be the cause? Is there any way to check if my model is using the "oneDNN Graph API" library?

Thank you.

chunyuan-w commented 1 year ago

@jingxu10 to help take a look at the provided example code.

jingxu10 commented 1 year ago

Do we need the model to reproduce the issue?

rubende commented 1 year ago

@jingxu10 I don't think it is a model-dependent problem from the tests I have done, but I leave here a complete python code with the model and its conversion to CPU:

model = nn.Sequential(
            nn.Conv2d(3, 64, kernel_size=3, stride=1, padding=0, bias=False),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
            nn.Conv2d(64, 32, kernel_size=3, stride=1, padding=0, bias=False),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
            nn.Conv2d(32, 16, kernel_size=2, stride=1, padding=0, bias=False),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
            nn.Conv2d(16, 8, kernel_size=2, stride=1, padding=0, bias=False),
            nn.ReLU(),
            nn.Flatten(),
            nn.Linear(128, 20),
            nn.ReLU(),
            nn.Linear(20, 10),
            nn.ReLU(),
            nn.Linear(10, 5),
            nn.ReLU(),
            nn.Linear(5, 1),
            nn.Sigmoid()
        )

model = model.to(“cpu”)
torch.jit.enable_onednn_fusion(True)
model.eval()
model = ipex.optimize(model, dtype=torch.float32)
model = torch.jit.trace(model, torch.rand(shape))
model = torch.jit.freeze(model _scripted)
torch.jit.save(model, path)

Compared to this tutorial, I suspect that something is going on with the oneDNN Graph API library, because I don't see it with the ldd command:

ldd pytorch_c
        linux-vdso.so.1 (0x00007ffc1dfb7000)
        libc10.so => /home/rdelgado/workspace/libtorch/lib/libc10.so (0x00007f26bad30000)
        libtorch_cpu.so => /home/rdelgado/workspace/libtorch/lib/libtorch_cpu.so (0x00007f26a355b000)
        libtorch.so => /home/rdelgado/workspace/libtorch/lib/libtorch.so (0x00007f26a3558000)
        libintel-ext-pt-cpu.so => /home/rdelgado/workspace/libtorch/lib/libintel-ext-pt-cpu.so (0x00007f269c369000)
        libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f269c135000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f269bfe4000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f269bfc9000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f269bdd7000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f26bae40000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f269bdcd000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f269bdc7000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f269bda4000)
        libgomp-52f2fd74.so.1 => /home/rdelgado/workspace/libtorch/lib/libgomp-52f2fd74.so.1 (0x00007f269bb6f000)
        libz.so.1 => /usr/local/lib/libz.so.1 (0x00007f269bb4e000)

sanchitintel commented 1 year ago

torch.jit.enable_onednn_fusion(True) is a PyTorch API. Please don't mix it with ipex.optimize. Also, please measure performance only after first 3 iterations (which you can think of as warmup rounds). Thanks

The reason that you don't see oneDNN Graph with ldd is that it's linked statically, not dynamically.

chunyuan-w commented 1 year ago

In addition, you'll need to add with torch.no_grad(): if it is an inference scenario:

An example with IPEX can be found here.

rubende commented 1 year ago

Hello,

I modified the code:

model = model.to(“cpu”)
model.eval()
with torch.no_grad():
    model = ipex.optimize(model, dtype=torch.float32)
    model = torch.jit.trace(model, torch.rand(shape))
    model = torch.jit.freeze(model)

with torch.no_grad():
        for i in range(0, 100):
            outputs = model .forward(test_numeric)    # warmup
        start = time.time()
        for i in range(0, 1000):
            outputs = model .forward(test_numeric)
        end = time.time()

The inference time of this code is still higher than that of the full design in oneDNN, about 3 times slower with exactly the same architecture.

EDIT: Just to clarify, I don't see any difference in inference time with and without the line torch.jit.enable_onednn_fusion(True).

Any other suggestions?

Thank you.

chunyuan-w commented 1 year ago

Could you please help provide the below information for us to reproduce and look into this issue?

the CPU model
the configuration: like how many CPU cores are used when measuring the performance?
the shape and test_numeric in the above script
the way you run that of the full design in oneDNN

For the question regarding torch.jit.enable_onednn_fusion(True), to double-confirm that oneDNN graph has been turned on, could you please try setting this environment variable DNNL_GRAPH_VERBOSE=1 and check if you have logs starting with onednn_graph_verbose similar as below:

onednn_graph_verbose,exec,cpu,100154,convolution_post_ops,aten::_convolution;aten::relu,data:NCX; filter:OIX;;,in0_f32:0:strided:variable:1x32x28x28:25088s1s896s32 in1_f32:1:strided:constant:32x32x3x3:288s9s3s1 out0_f32:9:strided:variable:1x32x28x28:25088s784s28s1,fpm:strict,dnnl_backend,0.136963

rubende commented 1 year ago

the CPU model

Intel(R) Xeon(R) Platinum 8153 CPU @ 2.00GHz. 64 cores.

the configuration: like how many CPU cores are used when measuring the performance?

I only see one core working, I am not explicitly configuring anything either. Same with the full oneDNN model

the shape and test_numeric in the above script

shape = [1, 3,56, 56]
test_numeric  = torch.rand(size = shape)

the way you run that of the full design in oneDNN

We just define the network with the primitives of oneDNN and compile with the compiler g++-9 (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0, with flags -march=skylake -O3.

About onednn_graph_verbose, I get output from oneDNN with DNNL_VERBOSE=1, but I don't get any output from onednn_graph with the flag DNNL_GRAPH_VERBOSE=1.

chunyuan-w commented 1 year ago

@Xia-Weiwen to follow up on this.

Xia-Weiwen commented 1 year ago

Hi @rubende Could you share your C++ code with oneDNN and the way to build it? The more details the better. Another question is how you measured the inference time? Thanks!

rubende commented 1 year ago

Unfortunately, I am not allowed to share the oneDNN code at this moment.

One thing I just noticed is that loading weights is having a major impact on the performance of all the models I am testing. That is, let's assume the following code:

model = nn.Sequential(
            .....
        )
checkpoint = torch.load(path)
model.load_state_dict(checkpoint)
model = model.to(“cpu”)
model.eval()

with torch.no_grad():
        model = ipex.optimize(model, dtype=torch.float32)
        model = torch.jit.trace(model, torch.rand(shape))
        model = torch.jit.freeze(model)

with torch.no_grad():
        for i in range(0, 100):
            outputs = model .forward(test_numeric)    # warmup
        start = time.time()
        for i in range(0, 1000):
            outputs = model .forward(test_numeric)
        end = time.time()

If I remove model.load_state_dict(checkpoint), the inference time is reduced by a factor of about 10 times.

Xia-Weiwen commented 1 year ago

Sorry but I don't understand why model.load_state_dict would have an impact. It is already done before you run inference, right?

Xia-Weiwen commented 1 year ago

Hi @rubende I tried with the code below and did not see much difference in latency if I save/load or not:

import torch
from torch import nn
import intel_extension_for_pytorch as ipex
import time

model = nn.Sequential(
            nn.Linear(16, 128),
            nn.ReLU(),
            nn.Linear(128, 512),
            nn.ReLU(),
            nn.Linear(512, 2048),
            nn.Flatten()
        )
for save_load in [True, False]:
    if save_load:
        path = 'ipex_infer_overhead.pt'
        torch.save(model.state_dict(), path)
        checkpoint = torch.load(path)
        model.load_state_dict(checkpoint)
    model = model.to("cpu")
    model.eval()

    with torch.no_grad():
        model = ipex.optimize(model, dtype=torch.float32)
        model = torch.jit.trace(model, torch.rand(1, 16))
        model = torch.jit.freeze(model)

    x = torch.randn(1, 16)
    with torch.no_grad():
        for i in range(10):
            outputs = model(x)    # warmup
        start = time.time()
        for i in range(100):
            outputs = model(x)
        end = time.time()

    print('save_load =', save_load, ', Time =', end - start, 'sec')

I ran it by python -m intel_extension_for_pytorch.cpu.launch --ninstances 1 --ncore_per_instance 4 test.py. Could you have a try? If you have any updates, please let us know. Thanks!

rubende commented 1 year ago

Hi @Xia-Weiwen. If I use your code I don't see any noticeable difference either. On the other hand, if I change your architecture for mine and use my trained weights file, I do see a big difference. Note that the difference occurs if these weights are actually the result of training:

import torch
from torch import nn
import intel_extension_for_pytorch as ipex
import torch.optim as optim
import time

class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.model = nn.Sequential(
            nn.Linear(256, 64, bias=True),
            nn.Conv1d(256, 32, kernel_size=2, stride=1, padding=0, bias=True),
            nn.BatchNorm1d(32),
            nn.ReLU(),
            nn.MaxPool1d(kernel_size=2),
            nn.Conv1d(32, 64, kernel_size=2, stride=1, padding=0, bias=True),
            nn.BatchNorm1d(64),
            nn.ReLU(),
            nn.MaxPool1d(kernel_size=2),
            nn.Conv1d(64, 128, kernel_size=2, stride=1, padding=0, bias=True),
            nn.BatchNorm1d(128),
            nn.ReLU(),
            nn.MaxPool1d(kernel_size=2),
            nn.Conv1d(128, 256, kernel_size=2, stride=1, padding=0, bias=True),
            nn.BatchNorm1d(256),
            nn.ReLU(),
            nn.MaxPool1d(kernel_size=2),
            nn.Conv1d(256, 512, kernel_size=2, stride=1, padding=0, bias=True),
            nn.BatchNorm1d(512),
            nn.ReLU(),
            nn.MaxPool1d(kernel_size=2),
            nn.Flatten(),
            nn.Dropout(0.2),
            nn.Linear(512, 1),
            nn.Sigmoid()
        )

        for param in self.model.parameters():
            param.grad = None

        self.optimizer = optim.SGD(self.model.parameters(), lr=0.01, momentum=0.9, weight_decay=1e-2)
        self.criterion = nn.BCELoss()

    def forward(self, x):
        return self.model(x)

    def fit(self, epochs, data, labels):
        self.model.train()
        for epoch in range(epochs):
            pred = self.forward(data)
            loss = self.criterion(pred, labels)
            self.optimizer.zero_grad()
            loss.backward()
            self.optimizer.step()

for save_load in [[False, False], [True, False], [True, True]]:
    original_model = Model()
    if save_load[0]:
        random_data = torch.rand((256, 256, 256))
        random_labels = torch.randint(0, 2, (256, 1)).float()
        if save_load[1]:
            path = 'model-final_weights'
        else:
            path = 'ipex_infer_overhead.pt'
            torch.save(original_model.state_dict(), path)
        checkpoint = torch.load(path)
        original_model.load_state_dict(checkpoint)
    original_model = original_model.to("cpu")
    original_model.eval()

    with torch.no_grad():
        model = ipex.optimize(original_model, dtype=torch.float32)
        model = torch.jit.trace(model, torch.rand(1, 256, 256))
        model = torch.jit.freeze(model)

    x = torch.randn(1, 256, 256)
    with torch.no_grad():
        for i in range(10):
            outputs = model(x)    # warmup
        start = time.time()
        for i in range(1000):
            outputs = model(x)
        end = time.time()

    print('save_load =', save_load[0], 'train =', save_load[1], ', Time =', end - start, 'sec')

If I ran python -m intel_extension_for_pytorch.cpu.launch --ninstances 1 --ncore_per_instance 4 test.py I see this:

python -m intel_extension_for_pytorch.cpu.launch --ninstances 1 --ncore_per_instance 4 test.py
/usr/lib/python3.8/runpy.py:127: RuntimeWarning: 'intel_extension_for_pytorch.cpu.launch' found in sys.modules after import of package 'intel_extension_for_pytorch.cpu', but prior to execution of 'intel_extension_for_pytorch.cpu.launch'; this may result in unpredictable behaviour
  warn(RuntimeWarning(msg))
2023-05-22 10:31:25,269 - __main__ - WARNING - Neither TCMalloc nor JeMalloc is found in $CONDA_PREFIX/lib or $VIRTUAL_ENV/lib or /.local/lib/ or /usr/local/lib/ or /usr/local/lib64/ or /usr/lib or /usr/lib64 or /home/rdelgado/.local/lib/ so the LD_PRELOAD environment variable will not be set. This may drop the performance
2023-05-22 10:31:25,270 - __main__ - INFO - OMP_NUM_THREADS=4
2023-05-22 10:31:25,270 - __main__ - WARNING - Unable to find the iomp library file libiomp5.so in $CONDA_PREFIX/lib or $VIRTUAL_ENV/lib or /.local/lib/ or /usr/local/lib/ or /usr/local/lib64/ or /usr/lib or /usr/lib64 or /home/rdelgado/.local/lib/ so the LD_PRELOAD environment variable will not be set.you can use 'conda install intel-openm' to install intel openMP
2023-05-22 10:31:25,270 - __main__ - INFO - numactl -C 0-3 -m 0 /opt/workdir/rdelgado/venv_pytorch/bin/python -u test.py
save_load = False train = False , Time = 0.4497230052947998 sec
save_load = True train = False , Time = 0.4350278377532959 sec
save_load = True train = True , Time = 3.88032603263855 sec

EDIT: Solved with torch.set_flush_denormal(True).

save_load = False train = False , Time = 0.5071511268615723 sec
save_load = True train = False , Time = 0.5011007785797119 sec
save_load = True train = True , Time = 0.48726487159729004 sec

Even so, I still see a difference between this model and the oneDNN model. I'm going to do some tests and see if I can give you more information.

Xia-Weiwen commented 1 year ago

Hi @rubende Where does your 'model-final_weights' come from? Could you provide a simple code snippet to generate that file? I used your code and modified it a little as below and did not reproduce the issue.

    original_model = Model()
    if save_load[0]:
        random_data = torch.rand((256, 256, 256))
        random_labels = torch.randint(0, 2, (256, 1)).float()
        if save_load[1]:
            original_model.fit(10, random_data, random_labels) # added by me
            path = 'ipex_infer_overhead-model-final_weights'
            torch.save(original_model.state_dict(), path) # added by me
        else:
            path = 'ipex_infer_overhead.pt'
            torch.save(original_model.state_dict(), path)
        checkpoint = torch.load(path)
        original_model.load_state_dict(checkpoint)
    original_model = original_model.to("cpu")
    original_model.eval()

Thanks!

intel / intel-extension-for-pytorch

Pytorch inference time issue #348