Open rubende opened 1 year ago
@jingxu10 to help take a look at the provided example code.
Do we need the model to reproduce the issue?
@jingxu10 I don't think it is a model-dependent problem from the tests I have done, but I leave here a complete python code with the model and its conversion to CPU:
model = nn.Sequential(
nn.Conv2d(3, 64, kernel_size=3, stride=1, padding=0, bias=False),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2),
nn.Conv2d(64, 32, kernel_size=3, stride=1, padding=0, bias=False),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2),
nn.Conv2d(32, 16, kernel_size=2, stride=1, padding=0, bias=False),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2),
nn.Conv2d(16, 8, kernel_size=2, stride=1, padding=0, bias=False),
nn.ReLU(),
nn.Flatten(),
nn.Linear(128, 20),
nn.ReLU(),
nn.Linear(20, 10),
nn.ReLU(),
nn.Linear(10, 5),
nn.ReLU(),
nn.Linear(5, 1),
nn.Sigmoid()
)
model = model.to(“cpu”)
torch.jit.enable_onednn_fusion(True)
model.eval()
model = ipex.optimize(model, dtype=torch.float32)
model = torch.jit.trace(model, torch.rand(shape))
model = torch.jit.freeze(model _scripted)
torch.jit.save(model, path)
Compared to this tutorial, I suspect that something is going on with the oneDNN Graph API library, because I don't see it with the ldd
command:
ldd pytorch_c
linux-vdso.so.1 (0x00007ffc1dfb7000)
libc10.so => /home/rdelgado/workspace/libtorch/lib/libc10.so (0x00007f26bad30000)
libtorch_cpu.so => /home/rdelgado/workspace/libtorch/lib/libtorch_cpu.so (0x00007f26a355b000)
libtorch.so => /home/rdelgado/workspace/libtorch/lib/libtorch.so (0x00007f26a3558000)
libintel-ext-pt-cpu.so => /home/rdelgado/workspace/libtorch/lib/libintel-ext-pt-cpu.so (0x00007f269c369000)
libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f269c135000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f269bfe4000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f269bfc9000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f269bdd7000)
/lib64/ld-linux-x86-64.so.2 (0x00007f26bae40000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f269bdcd000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f269bdc7000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f269bda4000)
libgomp-52f2fd74.so.1 => /home/rdelgado/workspace/libtorch/lib/libgomp-52f2fd74.so.1 (0x00007f269bb6f000)
libz.so.1 => /usr/local/lib/libz.so.1 (0x00007f269bb4e000)
torch.jit.enable_onednn_fusion(True)
is a PyTorch API. Please don't mix it with ipex.optimize
.
Also, please measure performance only after first 3 iterations (which you can think of as warmup rounds). Thanks
The reason that you don't see oneDNN Graph with ldd
is that it's linked statically, not dynamically.
In addition, you'll need to add with torch.no_grad():
if it is an inference scenario:
Hello,
I modified the code:
model = model.to(“cpu”)
model.eval()
with torch.no_grad():
model = ipex.optimize(model, dtype=torch.float32)
model = torch.jit.trace(model, torch.rand(shape))
model = torch.jit.freeze(model)
with torch.no_grad():
for i in range(0, 100):
outputs = model .forward(test_numeric) # warmup
start = time.time()
for i in range(0, 1000):
outputs = model .forward(test_numeric)
end = time.time()
The inference time of this code is still higher than that of the full design in oneDNN, about 3 times slower with exactly the same architecture.
EDIT: Just to clarify, I don't see any difference in inference time with and without the line torch.jit.enable_onednn_fusion(True)
.
Any other suggestions?
Thank you.
Could you please help provide the below information for us to reproduce and look into this issue?
shape
and test_numeric
in the above scriptthat of the full design in oneDNN
For the question regarding torch.jit.enable_onednn_fusion(True)
, to double-confirm that oneDNN graph has been turned on, could you please try setting this environment variable DNNL_GRAPH_VERBOSE=1
and check if you have logs starting with onednn_graph_verbose
similar as below:
onednn_graph_verbose,exec,cpu,100154,convolution_post_ops,aten::_convolution;aten::relu,data:NCX; filter:OIX;;,in0_f32:0:strided:variable:1x32x28x28:25088s1s896s32 in1_f32:1:strided:constant:32x32x3x3:288s9s3s1 out0_f32:9:strided:variable:1x32x28x28:25088s784s28s1,fpm:strict,dnnl_backend,0.136963
the CPU model
Intel(R) Xeon(R) Platinum 8153 CPU @ 2.00GHz. 64 cores.
the configuration: like how many CPU cores are used when measuring the performance?
I only see one core working, I am not explicitly configuring anything either. Same with the full oneDNN model
the shape and test_numeric in the above script
shape = [1, 3,56, 56]
test_numeric = torch.rand(size = shape)
the way you run
that of the full design in oneDNN
We just define the network with the primitives of oneDNN and compile with the compiler g++-9 (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
, with flags -march=skylake -O3
.
About onednn_graph_verbose
, I get output from oneDNN with DNNL_VERBOSE=1
, but I don't get any output from onednn_graph with the flag DNNL_GRAPH_VERBOSE=1
.
@Xia-Weiwen to follow up on this.
Hi @rubende Could you share your C++ code with oneDNN and the way to build it? The more details the better. Another question is how you measured the inference time? Thanks!
Unfortunately, I am not allowed to share the oneDNN code at this moment.
One thing I just noticed is that loading weights is having a major impact on the performance of all the models I am testing. That is, let's assume the following code:
model = nn.Sequential(
.....
)
checkpoint = torch.load(path)
model.load_state_dict(checkpoint)
model = model.to(“cpu”)
model.eval()
with torch.no_grad():
model = ipex.optimize(model, dtype=torch.float32)
model = torch.jit.trace(model, torch.rand(shape))
model = torch.jit.freeze(model)
with torch.no_grad():
for i in range(0, 100):
outputs = model .forward(test_numeric) # warmup
start = time.time()
for i in range(0, 1000):
outputs = model .forward(test_numeric)
end = time.time()
If I remove model.load_state_dict(checkpoint)
, the inference time is reduced by a factor of about 10 times.
Sorry but I don't understand why model.load_state_dict
would have an impact. It is already done before you run inference, right?
Hi @rubende I tried with the code below and did not see much difference in latency if I save/load or not:
import torch
from torch import nn
import intel_extension_for_pytorch as ipex
import time
model = nn.Sequential(
nn.Linear(16, 128),
nn.ReLU(),
nn.Linear(128, 512),
nn.ReLU(),
nn.Linear(512, 2048),
nn.Flatten()
)
for save_load in [True, False]:
if save_load:
path = 'ipex_infer_overhead.pt'
torch.save(model.state_dict(), path)
checkpoint = torch.load(path)
model.load_state_dict(checkpoint)
model = model.to("cpu")
model.eval()
with torch.no_grad():
model = ipex.optimize(model, dtype=torch.float32)
model = torch.jit.trace(model, torch.rand(1, 16))
model = torch.jit.freeze(model)
x = torch.randn(1, 16)
with torch.no_grad():
for i in range(10):
outputs = model(x) # warmup
start = time.time()
for i in range(100):
outputs = model(x)
end = time.time()
print('save_load =', save_load, ', Time =', end - start, 'sec')
I ran it by python -m intel_extension_for_pytorch.cpu.launch --ninstances 1 --ncore_per_instance 4 test.py
.
Could you have a try? If you have any updates, please let us know. Thanks!
Hi @Xia-Weiwen. If I use your code I don't see any noticeable difference either. On the other hand, if I change your architecture for mine and use my trained weights file, I do see a big difference. Note that the difference occurs if these weights are actually the result of training:
import torch
from torch import nn
import intel_extension_for_pytorch as ipex
import torch.optim as optim
import time
class Model(nn.Module):
def __init__(self):
super(Model, self).__init__()
self.model = nn.Sequential(
nn.Linear(256, 64, bias=True),
nn.Conv1d(256, 32, kernel_size=2, stride=1, padding=0, bias=True),
nn.BatchNorm1d(32),
nn.ReLU(),
nn.MaxPool1d(kernel_size=2),
nn.Conv1d(32, 64, kernel_size=2, stride=1, padding=0, bias=True),
nn.BatchNorm1d(64),
nn.ReLU(),
nn.MaxPool1d(kernel_size=2),
nn.Conv1d(64, 128, kernel_size=2, stride=1, padding=0, bias=True),
nn.BatchNorm1d(128),
nn.ReLU(),
nn.MaxPool1d(kernel_size=2),
nn.Conv1d(128, 256, kernel_size=2, stride=1, padding=0, bias=True),
nn.BatchNorm1d(256),
nn.ReLU(),
nn.MaxPool1d(kernel_size=2),
nn.Conv1d(256, 512, kernel_size=2, stride=1, padding=0, bias=True),
nn.BatchNorm1d(512),
nn.ReLU(),
nn.MaxPool1d(kernel_size=2),
nn.Flatten(),
nn.Dropout(0.2),
nn.Linear(512, 1),
nn.Sigmoid()
)
for param in self.model.parameters():
param.grad = None
self.optimizer = optim.SGD(self.model.parameters(), lr=0.01, momentum=0.9, weight_decay=1e-2)
self.criterion = nn.BCELoss()
def forward(self, x):
return self.model(x)
def fit(self, epochs, data, labels):
self.model.train()
for epoch in range(epochs):
pred = self.forward(data)
loss = self.criterion(pred, labels)
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
for save_load in [[False, False], [True, False], [True, True]]:
original_model = Model()
if save_load[0]:
random_data = torch.rand((256, 256, 256))
random_labels = torch.randint(0, 2, (256, 1)).float()
if save_load[1]:
path = 'model-final_weights'
else:
path = 'ipex_infer_overhead.pt'
torch.save(original_model.state_dict(), path)
checkpoint = torch.load(path)
original_model.load_state_dict(checkpoint)
original_model = original_model.to("cpu")
original_model.eval()
with torch.no_grad():
model = ipex.optimize(original_model, dtype=torch.float32)
model = torch.jit.trace(model, torch.rand(1, 256, 256))
model = torch.jit.freeze(model)
x = torch.randn(1, 256, 256)
with torch.no_grad():
for i in range(10):
outputs = model(x) # warmup
start = time.time()
for i in range(1000):
outputs = model(x)
end = time.time()
print('save_load =', save_load[0], 'train =', save_load[1], ', Time =', end - start, 'sec')
If I ran python -m intel_extension_for_pytorch.cpu.launch --ninstances 1 --ncore_per_instance 4 test.py
I see this:
python -m intel_extension_for_pytorch.cpu.launch --ninstances 1 --ncore_per_instance 4 test.py
/usr/lib/python3.8/runpy.py:127: RuntimeWarning: 'intel_extension_for_pytorch.cpu.launch' found in sys.modules after import of package 'intel_extension_for_pytorch.cpu', but prior to execution of 'intel_extension_for_pytorch.cpu.launch'; this may result in unpredictable behaviour
warn(RuntimeWarning(msg))
2023-05-22 10:31:25,269 - __main__ - WARNING - Neither TCMalloc nor JeMalloc is found in $CONDA_PREFIX/lib or $VIRTUAL_ENV/lib or /.local/lib/ or /usr/local/lib/ or /usr/local/lib64/ or /usr/lib or /usr/lib64 or /home/rdelgado/.local/lib/ so the LD_PRELOAD environment variable will not be set. This may drop the performance
2023-05-22 10:31:25,270 - __main__ - INFO - OMP_NUM_THREADS=4
2023-05-22 10:31:25,270 - __main__ - WARNING - Unable to find the iomp library file libiomp5.so in $CONDA_PREFIX/lib or $VIRTUAL_ENV/lib or /.local/lib/ or /usr/local/lib/ or /usr/local/lib64/ or /usr/lib or /usr/lib64 or /home/rdelgado/.local/lib/ so the LD_PRELOAD environment variable will not be set.you can use 'conda install intel-openm' to install intel openMP
2023-05-22 10:31:25,270 - __main__ - INFO - numactl -C 0-3 -m 0 /opt/workdir/rdelgado/venv_pytorch/bin/python -u test.py
save_load = False train = False , Time = 0.4497230052947998 sec
save_load = True train = False , Time = 0.4350278377532959 sec
save_load = True train = True , Time = 3.88032603263855 sec
EDIT: Solved with torch.set_flush_denormal(True)
.
save_load = False train = False , Time = 0.5071511268615723 sec
save_load = True train = False , Time = 0.5011007785797119 sec
save_load = True train = True , Time = 0.48726487159729004 sec
Even so, I still see a difference between this model and the oneDNN model. I'm going to do some tests and see if I can give you more information.
Hi @rubende Where does your 'model-final_weights' come from? Could you provide a simple code snippet to generate that file? I used your code and modified it a little as below and did not reproduce the issue.
original_model = Model()
if save_load[0]:
random_data = torch.rand((256, 256, 256))
random_labels = torch.randint(0, 2, (256, 1)).float()
if save_load[1]:
original_model.fit(10, random_data, random_labels) # added by me
path = 'ipex_infer_overhead-model-final_weights'
torch.save(original_model.state_dict(), path) # added by me
else:
path = 'ipex_infer_overhead.pt'
torch.save(original_model.state_dict(), path)
checkpoint = torch.load(path)
original_model.load_state_dict(checkpoint)
original_model = original_model.to("cpu")
original_model.eval()
Thanks!
Hello,
First of all, I would like to clarify that this is an open issue following the recommendation received in this other oneDNN issue.
As I said in that issue, I am having a big difference in inference times between the same model defined with oneDNN and defined in pytorch and exported for CPU with the "intel-extension-for-pytorch" library.
For now, here is the information I can provide:
The procedure I am following with the model in Python to prepare it for CPU is this:
While the CPU-ready model has an inference time of about 300 microseconds, the same model defined directly in oneDNN takes only 20 microseconds.
Is this difference normal? If not, what could be the cause? Is there any way to check if my model is using the "oneDNN Graph API" library?
Thank you.