intel / webml-polyfill

Deprecated, the Web Neural Network Polyfill project has been moved to https://github.com/webmachinelearning/webnn-polyfill
Apache License 2.0
160 stars 42 forks source link

[Windows/clDNN] The browser crashed when running Inception v4(TFlite) and Inception Resnet v2(TFlite) on Windows WebML #507

Closed Christywl closed 5 years ago

Christywl commented 5 years ago

Test Env: Chromium Version: nightly build 70.0.3503.0 (8b58220) Platform: Windows 10(Dell XPS 13)[CPU: Intel i5-8250U, GPU: Intel UHD Craphics 620(driver: 25.20.100.6471), Memory: 8GB]

Expected Result: Inception v4(TFlite) and Inception Resnet v2(TFlite) work.

Actual Result: The browser crashed when running Inception v4(TFlite) and Inception Resnet v2(TFlite) on Windows WebML

cldnn

How to Reproduce:

  1. git clone https://github.com/intel/webml-polyfill
  2. npm i & npm run build
  3. Download the models
  4. npm start
  5. Visit http://127.0.0.1:8080/examples/image_classification/index.html
  6. Select Inception v4(TFlite) [or Inception Resnet v2(TFlite) ] and SUSTAINED_SPEED
Christywl commented 5 years ago

DenseNet(Onnx) has the same issue on Windows clDNN.

huningxin commented 5 years ago

@Christywl thanks for reporting this issue!

Is this a regression? According to https://github.com/intel/webml-polyfill/wiki/WebML-Examples-Results-on-Different-Backends-and-Platforms, all these models worked on 8755e6b. Is that correct?

Christywl commented 5 years ago

@huningxin , retested and not a regression, it's my mistake in the previous testing. Now I'm testing the examples based on the newer build and codes, and will update the table in the wiki later.

huningxin commented 5 years ago

That's fine. Thanks for the update. However, does it happen on Linux/clDNN?

Christywl commented 5 years ago

No, these examples work on Linux/clDNN.

huningxin commented 5 years ago

No, these examples work on Linux/clDNN.

@Christywl , thanks! May I know whether the Linux machine and Windows machine under test have same hardware configuration? I guess the crash may be related to memory limit. But I don't know whether it is Windows specific. So could you please help verify it on Linux machine with same memory amount with Windows one. And as I know, our Linux test will use "--no-sandbox" option, could you please also apply that for Windows for testing? Thanks!

Christywl commented 5 years ago

@huningxin , the Linux and Windows machine configuration:

And I also tried on Windows with "--no-sandbox", this issue still happened.

So could you please help verify it on Linux machine with same memory amount with Windows one.

I don't have the Linux machine with 8GB memory. But I tried another Windows and Linux with the same configuration[Dell Inspiron 13 7000 Series, CPU: i5-6200U, GPU: Intel HD Graphics 520, Memory: 4GB], the crash issue only happened on Windows. The examples worked fine on Linux.

huningxin commented 5 years ago

But I tried another Windows and Linux with the same configuration[Dell Inspiron 13 7000 Series, CPU: i5-6200U, GPU: Intel HD Graphics 520, Memory: 4GB], the crash issue only happened on Windows. The examples worked fine on Linux.

That's very helpful. Thanks much @Christywl !

I plan to upgrade clDNN to latest version (Drop 12.1) which has memory leak fixing might be related to this issue. We can verify whether this issue still happen on the latest clDNN version. I've opened https://github.com/intel/webml-polyfill/issues/527 to track it.

huningxin commented 5 years ago

The root cause might be that the long time compilation of these models trigger the gpu watch dog which kills the GPU process.

@Christywl , could you please help try again with "--disable-gpu-watchdog"? This workaround works in my environment.

https://github.com/intel/webml-polyfill/issues/514 has the same root cause. Please help verify as well.

To fix this, we may need to implement off main thread ml service as https://github.com/intel/webml-polyfill/issues/517.

Christywl commented 5 years ago

@huningxin , I tried again with "--disable-gpu-watchdog" on Windows, the examples worked, the browser didn't crash. SSDLite MobileNetV2(TFlite) in #514 also worked with this workaround.

Christywl commented 5 years ago

This issue has been fixed on the nightly build a5a8547.