Some bugs in WebGL and WebGPU

Aixile commented 6 years ago

Codes and the model for reproducing can be found here, I am using webdnn with commit f403a30da36b6741bc857c21c3ca1e65af8fbac9

For model conversion, please use python convert_webdnn.py --chainer_model_path SmoothedGenerator_40000.npz --out models/resnet256

Also, there is a web interface in webcode/webdnn.

When I try to convert to WebGL with 8bit compression, I got

Generator model loaded
Start Convert
Traceback (most recent call last):

File "convert_webdnn.py", line 44, in <module>
exec_info = generate_descriptor("webgl", graph)
File "/Users/aixile/anaconda3/envs/py36/lib/python3.6/site-packages/webdnn-1.2.3-py3.6.egg/webdnn/backend/interface/generator.py", line 107, in generate_descriptor
return generator(graph, **kwargs)
File "/Users/aixile/anaconda3/envs/py36/lib/python3.6/site-packages/webdnn-1.2.3-py3.6.egg/webdnn/backend/webgl/generator.py", line 92, in generate
return WebGLDescriptorGenerator.generate(graph, **kwargs)
File "/Users/aixile/anaconda3/envs/py36/lib/python3.6/site-packages/webdnn-1.2.3-py3.6.egg/webdnn/backend/webgl/generator.py", line 59, in generate
constants_bytes = constant_encoder.encode(memory_layout)
File "/Users/aixile/anaconda3/envs/py36/lib/python3.6/site-packages/webdnn-1.2.3-py3.6.egg/webdnn/encoder/constant_encoder_eightbit.py", line 66, in encode
all_code += self._single_encode(single_data, alloc)
File "/Users/aixile/anaconda3/envs/py36/lib/python3.6/site-packages/webdnn-1.2.3-py3.6.egg/webdnn/encoder/constant_encoder_eightbit.py", line 72, in _single_encode
maxval = np.max(np.abs(single_data))
File "/Users/aixile/anaconda3/envs/py36/lib/python3.6/site-packages/numpy/core/fromnumeric.py", line 2272, in amax
out=out, **kwargs)
File "/Users/aixile/anaconda3/envs/py36/lib/python3.6/site-packages/numpy/core/_methods.py", line 26, in _amax
return umr_maximum(a, axis, None, out, keepdims)
ValueError: zero-size array to reduction operation maximum which has no identity

WebGL without 8bit compression can be sucessfully converted, However, it gives wrong answer.

Expected:

Got:

WebGPU model can be converted, however, it cannot be loaded by the browser.

Model loading failed for webgpu backend. Trying next backend: Range consisting of offset and length are out of bounds

Safari 11.0.3

This repo also contains a speed comparsion with tensorflow.js, webdnn with webgl is 1.5~2x faster than tfjs on my computer, except it gives a wrong anwser.

milhidaka commented 6 years ago

Sorry for late reply. I will investigate it.

milhidaka commented 6 years ago

This bug also occur in train_mnist_chainer.py with constant_encoder_name="eightbit". (non-constant) variable offset in webgl is "-1", and it causes error in the encoder. Continuing to debug. https://github.com/mil-tokyo/webdnn/blob/e6ab747b13d8ed6f9da2e78385bce10812f2be28/src/graph_transpiler/webdnn/encoder/constant_encoder_eightbit.py#L66

milhidaka commented 6 years ago

It broken in commit 56113b24. From this commit, train_mnist_chainer.py with constant_encoder_name="eightbit" on generate_descriptor raises error in webgl backend.

milhidaka commented 6 years ago

There seems to be three different bugs! I solved one, and found workaround for another one.

Problems:

Weight packing on WebGL backend (solved)
Graph conversion error on WebGL (workaround)
Error on WebGPU (not yet)

Weight packing problem occurred in constant_encoder_name="eightbit" On WebGL, size of texture and original variable differs because texture have to be rectangle. Texture size is calculated by height * width, and they must be integer. Therefore, rounding up is applied for texture size, which makes texture size > original size. However, it is not considered in constant_encoder_eightbit.py. Also, classification of constant and variable was wrong.

I put temporary fix to fix-816 branch (a686df1ec), so please try it to avoid this problem.

Graph conversion error on WebGL There is some bug in WebGL backend to transforming computation graph for texture size 4096 and 8192. Their weight size (weight_webgl_4096.bin) is unnaturally small.

$ ls -l models/resnet256
total 1206976
-rw-r--r--  1 hidaka  staff      37301  4 30 21:14 graph_webassembly.json
-rw-r--r--  1 hidaka  staff    3476587  4 30 21:14 graph_webgl_16384.json
-rw-r--r--  1 hidaka  staff    6124614  4 30 21:14 graph_webgl_4096.json
-rw-r--r--  1 hidaka  staff    4214513  4 30 21:14 graph_webgl_8192.json
-rw-r--r--  1 hidaka  staff     296498  4 30 21:02 graph_webgpu.json
-rw-r--r--  1 hidaka  wheel     106503  4 30 21:14 kernels_asmjs.js
-rw-r--r--  1 hidaka  staff       9748  4 30 21:14 kernels_asmjs.js.mem
-rw-r--r--  1 hidaka  staff      51407  4 30 21:14 kernels_webassembly.cpp
-rw-r--r--  1 hidaka  wheel      24125  4 30 21:14 kernels_webassembly.js
-rw-r--r--  1 hidaka  staff      56040  4 30 21:14 kernels_webassembly.wasm
-rw-r--r--  1 hidaka  staff      65574  4 30 21:02 kernels_webgpu.metal
-rw-r--r--  1 hidaka  staff  184662028  4 30 21:14 weight_webassembly.bin
-rw-r--r--  1 hidaka  staff  184662028  4 30 21:14 weight_webgl_16384.bin
-rw-r--r--  1 hidaka  staff   14792716  4 30 21:14 weight_webgl_4096.bin
-rw-r--r--  1 hidaka  staff   33667084  4 30 21:14 weight_webgl_8192.bin
-rw-r--r--  1 hidaka  staff  184662028  4 30 21:02 weight_webgpu.bin

I found that graph descriptor for size 16384 works correctly. Currently, all devices loads size 4096, so the workaround is

cp weight_webgl_16384.bin weight_webgl_4096.bin
cp graph_webgl_16384.json graph_webgl_4096.json

Of course, it does not work devices which does not support texture size 16384.

By these two workarounds, I managed to WebGL + 8bit compression model to work on Chrome.

Kiikurage commented 6 years ago

I started to track these two problems in #820 and #821.

Kiikurage commented 6 years ago

@milhidaka I re-implement your patch in e06f903, with some extra comments. Please review it.

mil-tokyo / webdnn

Some bugs in WebGL and WebGPU #816