tensorflow / tfjs

A WebGL accelerated JavaScript library for training and deploying ML models.
https://js.tensorflow.org
Apache License 2.0
18.48k stars 1.93k forks source link

Why mobileNet.predict() hit a performance cliff after certain amount of runs? #145

Closed nsthorat closed 5 years ago

nsthorat commented 6 years ago

From @samwyi on April 5, 2018 22:21

TensorFlow.js version: 0.6.0

TensorFlow.js Core version: 0.6.0

Browser version: Chrome 65.0.3325.181 (Official Build) (64-bit)

Describe the problem or feature request

mobileNet.predict() hit a performance cliff after certain amount of runs. On my MacBook Pro, the average of 100 runs is ~11ms, while the average of 200 runs drops to ~46ms. Similar issue happened on my Android device. I wonder what caused the performance drop? Anyway to avoid it? Thanks.

Code to reproduce the bug / link to feature request:

Change tfjs-converter/demo/index.js cat.onload() code to call mobileNet.predict() multiple times in a loop:

console.time('Subsequent predictions'); for (let i = 0; i < 200; i++) { result = mobileNet.predict(pixels); } console.timeEnd('Subsequent prediction');

Copied from original issue: tensorflow/tfjs-core#925

nsthorat commented 6 years ago

Try await tf.nextFrame() between each tick (you may have to wrap your code in a function marked as async).

async function run() {
  console.time('Subsequent predictions');
  for (let i = 0; i < 200; i++) {
    result = mobileNet.predict(pixels);
    await tf.nextFrame();
  }
  console.timeEnd('Subsequent prediction');
}
run();
samwyi commented 6 years ago

@nsthorat Adding "await tf.nextFrame()" really makes the performance number consistent on my laptop!

But on smartphones, the performance cliff still exits. After profiling with the Chrome performance dev tool, I noticed that after every 20-30 predictions, there is a long wait of 30-60 seconds (not ms!), with GPU shown as busy (solid green). Since I put "await tf.nextFrame()" between each prediction, there should be at most 1 inference running on the GPU, I wonder what is causing the long wait? Any idea?

I tested on two Android phones, both have similar issue.

nsthorat commented 6 years ago

Ah, apologies, you also probably have a memory leak and need to dispose the result of each predict. Try this code:

async function run() {
  console.time('Subsequent predictions');
  for (let i = 0; i < 200; i++) {
    result = mobileNet.predict(pixels);
    await tf.nextFrame();
    result.dispose();  // Get rid of the result
  }
  console.timeEnd('Subsequent prediction');
}
run();

Check out our section on memory in this tutorial: https://js.tensorflow.org/tutorials/core-concepts.html

samwyi commented 6 years ago

Tried calling result.dispose() after each prediction. On my Android phone (Huawei BLN-L24, Android 7, GPU: Mali-T830MP2), it really eliminated the 60 seconds wait between groups of 20-30 predictions, BUT at the cost of 2 seconds wait for each result.dispose(). I'm surprised to see that releasing GPU memory took much longer time than running the model itself ;-) Is this a issue in tensorflow.js or the GPU driver?

nsthorat commented 6 years ago

Interesting, it really shouldn't take 2 seconds for a dispose(). Actually, our dispose just marks memory for reuse, it doesn't actually trash the memory.

One thing to test, console log dl.memory().numTensors() at each tick and make sure the number of tensors isn't increasing.

samwyi commented 6 years ago

tfc.memory.numTensors increases by 4 after each mobileNet.predict() call. Calling dispose() doesn't seem to help :(

dsmilkov commented 6 years ago

Hi Samwyi,

Can you share some simple code to reproduce this? This way we can take a closer look. Especially regarding the number of tensors going up by 4. after each mobileNet.predict()

RadEdje commented 6 years ago

Hello, just wanted to ask if the perforamance cliff has been figured out on android 5.0? I built a proof of concept web app that uses the videocam of a phone or the webcam of a desktop/laptop to detect/recognize radiographic findings.

The web apps are at 2 versions:

https://radhorizon.com/SITES/RadLense/ (this uses tj.js version 0.10.0)

and

https://radhorizon.com/SITES/RadLense/index3.html (this uses the latest tensorflow.js ver 0.11.7)

These all work on the latest firefox, opera and chrome on desktop as well as the latest chrome on android 8.0.

the latest 0.11.7 is blazing fast I must say compared to ver 0.10.0 but they both seem to have the same problem; they only work on android 8.0 but not android 5.0.

I've tried numerous ways to debug and look for the cause of the problem. No errors show up on the error log so it's not a javascript issue. The web cam runs so the phone is detecting and putting the web cam data in a video element. The AI/ML model loads properly since I rigged the app to stop at the splash screen if it doesn't. I had to manually put console.log("check here"); to see which part of the app was stalling since no actual errors were showing up on console. This is when I narrowed it down to model.predict() and that's how I found this thread. I tried the numerous solutions but it still does not seem to work. Just hoping to know if anyone gets tj.js to run on android 5.0? or I should just wait for everyone to end up with android 8.0. thanks.

nsthorat commented 6 years ago

Hi, can you rerun this with the latest version? Thanks!

nsthorat commented 5 years ago

Closing this out for inactivity.