Difference between First Run on Benchmark Run and In Code

rohanmuplara commented 3 years ago

When I run a model for first time, it is slow. This is expected. However, the difference in times I get for the first time is way different in benchmarking setup(https://tensorflow.github.io/tfjs/e2e/benchmarks/local-benchmark/index.html) and my own custom setup https://github.com/rohanmuplara/tester/tree/test_tfjs/graph I see the time difference for warmup inference differ between 200ms and 2000ms. like 10 to 20x difference. I observe this for almost all my models. I have tried to record a screencast of exactly steps that I followed https://share.descript.com/view/4UBmlV8jmqd. All these experiments were on chrome and on mac with newest tfjs build

rohanmuplara commented 3 years ago

I have simplified it even more. I am using tfhub mobilenet and basic predict https://github.com/rohanmuplara/tester/tree/test_tfjs2/graph and am noticing 10x difference.

I was wondering any suggestions how to reduce this warmup time. Also, when does this warmup copying to the gpu terminate. From my observation, reloading the page causes it to be slow again. Let say I have an iframe same domain on many pages. Is there a way to take advantage of this prerunning one page when another users switches to another page but the same iframe will it work? Is there a way to cache stuff in gpu? Thanks, Rohan

wingman-jr-addon commented 3 years ago

Sometimes async is an interesting source of non-determinism. I also noticed that it appears you are getting the Tensor but never actually getting the data out of it - which is an important step that takes time. What happens if you run the following code for benchmark.js? For me times increased; times were already fairly smooth anyways. (FF87, Win 10)

function runModel(model, tensors, returnTensorReferences ) {
    let num_outputs = model.outputs.length;
    const predictionsTensor =  model.predict(tensors).dataSync();
    return predictionsTensor;
}

async function benchmarkInput (model_path, tensors, num_runs) {

  console.time("model loading time");
  let model = await tf.loadGraphModel(model_path, { fromTFHub: true });
  console.timeEnd("model loading time");
  console.time("first prediction");
  const predictions = runModel(model, tensors, false);
  console.timeEnd("first prediction");

  let subsequent_times =new Float32Array(num_runs - 1);
  for (let i = 0; i < num_runs - 1 ; i++) {
    let begin= window.performance.now();
    const predictions = runModel(model, tensors, true);
    let end= window.performance.now();
    let time = (end-begin) ;
    subsequent_times[i] = time;
  }
  console.log("subsequent predictions are in ms", subsequent_times);
  console.log("the average of the subequent predictions are", average(subsequent_times));
}

function average(array) {
    let average = array.reduce((a, b) => a + b) / array.length;
    return average;
}

function benchmarkInputDefininedInCode() {
    let tensor1 = tf.ones([224, 224,3]);
    tensor1 = tensor1.expandDims(0);
    benchmarkInput("https://tfhub.dev/google/tfjs-model/imagenet/mobilenet_v2_100_224/feature_vector/2/default/1", [tensor1], 100);
}
benchmarkInputDefininedInCode();

rohanmuplara commented 3 years ago

@ wingman-jr-addon. Totally get your point about datasync.

If you look at first post above https://github.com/tensorflow/tfjs/issues/4907#issue-851978678, there is a more complicated setup where I do do datasync. I just didn't do datasync above to make it easier; ie is syncing time causing delays. I do agree subsequent iterations are really fast and I tried your setup above and doesn't really make a difference. I do agree that subsequent times are good and only really considering the first time.

The problem for me is the first iteration time. It is (10 to 20 more times slower than subequent runs). My two questions are a. https://tensorflow.github.io/tfjs/e2e/benchmarks/local-benchmark/index.html this is way slower(10-20x) than I get in benchmarking tool for first run. Subsequent runs in my setup are relative same time as benchmarking tool but the first one don't agree.

and b. I was wondering any suggestions how to reduce this warmup time. Also, when does this warmup copying to the gpu terminate. From my observation, reloading the page causes it to be slow again. Let say I have an iframe same domain on many pages. Is there a way to take advantage of this prerunning one page when another users switches to another page but the same iframe will it work? Is there a way to cache stuff in gpu?

wingman-jr-addon commented 3 years ago

@rohanmuplara Ah, well if you're wondering about the specifically why the first inference is so slow, that was discussed over in #1715 - basically the shaders are compiling if you're using WebGL the first time through. As a side note: try the WASM backend as a comparison once, but be aware performance differences on backends vary greatly from machine to machine. As an example, WASM is much slower on my machine the but first inference has very little penalty.

Regarding caching on the GPU ... Well, maybe if you were able to do some type of communication between web pages and indicated that one was the "server" and the others decided to become "clients" when they opened and detected there was already a "server" present.

rohanmuplara commented 3 years ago

@wingman-jr-addon My question is twofold a. what is the difference between the benchmark tool and my setup

The problem for me is the first iteration time. It is (10 to 20 more times slower than subequent runs). My two questions are a. https://tensorflow.github.io/tfjs/e2e/benchmarks/local-benchmark/index.html this is way slower(10-20x) than I get in benchmarking tool for first run. Subsequent runs in my setup are relative same time as benchmarking tool but the first one don't agree.

on b. I do agree first inference on wasm is quicker and than much slower for subsequent runs. My question is what the default behavior of tfjs in terms of reloading the page. Additionally, in tfjs, is there a way to cache the shaders compilation or let the tfjs code know that the tfjs shaders are already present.

wingman-jr-addon commented 3 years ago

@rohanmuplara I see, sorry for being dense.

Well, I may be able to solve part of your mystery but not all of it. If I'm not mistaken, the benchmark may actually be doing a prediction inside the model load itself. From https://tensorflow.github.io/tfjs/e2e/benchmarks/local-benchmark/index.html :

    async function loadModelAndRecordTime() {
      updateStateFromURLState();
      const benchmark = benchmarks[state.benchmark];
      state.modelType = benchmark['type'];
      state.isModelChanged = false; // used to clean the performance history

      if (benchmark['load'] == null) {
        throw new Error(`Please provide a load method for '${state.benchmark}' model.`);
      }

      await showMsg('Loading the model');
      let start = performance.now();
      const inputSize = parseFloat(state.inputSize);
      model = await benchmark.load(inputSize, state.architecture, state.inputType);
      state.inputs = [];
      if (model.inputs) {
        // construct the input state for the model
        for (let modelInputIndex = 0; modelInputIndex < model.inputs.length; modelInputIndex++) {
          let modelInput = model.inputs[modelInputIndex];
          if (modelInput.shape == null) continue;
          let shape = modelInput.shape.map(e => e == null ? -1 : e);
          state.inputs.push({
            name: modelInput.name,
            shape: [...shape],
            dtype: modelInput.dtype,
            range: [0, 1000]
          });
        }
      }
      predict = benchmark.predictFunc(inputSize);
      const elapsed = performance.now() - start;

      await showMsg(null);
      appendRow(timeTable, 'Model load', printTime(elapsed));
    }

Now looking at your video I see that the model load time is still too small to account for this, and I see similar for my own models (<200ms). Notably, it also does not match my experience in using the model in my own context either: I expected load plus first inference times of >10 seconds. I'm pretty sure this is an artifact of the benchmark tool somehow because if I load the standard mobilenet_v2 benchmark the times are much longer than custom despite the model being "smaller" than my model. Do you experience an unexpected difference in timing between your custom benchmark and mobilenet_v2? I'm not sure how big your model is supposed to be.

One more thing I find that makes me suspicious of the benchmark. I watched your video closely and while the warmup time is ~150ms, it actually looks like it was working from about 1:49 to 1:53, which would be consistent with your ~3500ms in your own code. So... I think the benchmark may have a bug?

rohanmuplara commented 3 years ago

Hey man no need to apologize. I am super appreciative of all your help!

Yes in the benchmark tool I get this discrepancy every time. I used a custom model (have tried at least 20.) and have observed this error in all of them. I think you are correct that line shouldn't be there or more documentation be around it. I think the problem is there is no await for that prediction. Please see this video https://share.descript.com/view/0Fqye7z8hGW where I try to explain the bug. I am very unsure to be clear.

So I just want to be clear in my own javascript setup(non-benchmark tool) this is the expected behavior and this is how long it is supposed to take. Is there any good workaround for webgl. Is there a way to preserve the textures in gpu(caching whatever) after page reload or a new page with same iframe enabled on both tabs. I will be frank I don't much about this so any help here would be appreciated about how tfjs behaves and how to optimize this

vladmandic commented 3 years ago

@rohanmuplara

No idea why benchmarking tool would result such a huge difference on first inference other than minor difference that: a) benchmark tool uses model.executeAsyc() for custom graph-based models and model.predict() only for layers-based models
while it uses model.predict() for all predefined models b) benchmark tool uses tf.randomNormal() to setup input tensor (vs your use of tf.ones())

On the topic of GL shader caching, this gets tricky:

TFJS maintains numBytesAllocated and if this is higher than threshold,
it simply calls GL function unbindColorTextureFromFramebuffer
However:
- Default threshold is infinite by default
  (see WEBGL_FLUSH_THRESHOLD and WEBGL_DELETE_TEXTURE_THRESHOLD)
  So that's not an issue (unless you're low on GPU memory)
- TFJS cannot maintain allocation table between page refreshes or between different iframes
  There is simply no way to do it in JS due to page isolation enforced by all modern browsers
  Just check tf.engine().state.numBytes, tf.engine().backendInstance.numBytesInGPU and tf.engine().backendInstance.textureManager.numBytesAllocated
  values on page refresh - it's all zeroes
After unbind, it's up to the browser to perform actual GPU memory garbage collection
I've looked around and there is no way to tune it
(and different browsers do behave differently, for example Firefox is much more aggressive than Chrome)

Now, one way that comes to mind would be to try save entire state of the tf.engine().backendInstance.textureManager
(bit more complicated as most values are read-only and would need to access them on a lower level)
to somewhere like LocalStorage in browser and upon page load to restore it - but that would get very messy very fast

And there is still no guarantees when browser's garbage collection would kick in as there would always be a delay between page load and before state is restored, so should probably download all GL textures, save them as well and restore them on load.

rohanmuplara commented 3 years ago

@vladmandic thanks for all your points again.

not sure if you care about the details but I think not sure issue is https://share.descript.com/view/0Fqye7z8hGW. the load model calls (predictFunction) with no await and in custom models these are async (so they return once executeSync is called not returned) so this time doesn't get accounted for.

On the point of optimization, I get all your points about multiple tabs and not having reference. I don't

On the textureManager suggestion writing to disk, is time consuming part creating the gl textures from the model or is it copying to the gpu(this impacts whether writing to local storage is helpful.)
Could we maybe precompute these ie maybe there is a few differences per browser; but precompute this beforehand offline and load this directly from cloud storage.
changes required to tfjs: In addition to making some of these writeonly, I guess some code would have to be changed in tfjs to not reallocate and compile everything to the gpu when calling predict for the first time and to use this texture manager. .

vladmandic commented 3 years ago

predictFunction has variable definition:

for predefined models it's simply model.predict() which is a sync call
for custom models, if model is of type layers, it's also just model.predict()
for custom models, if model is of type graph, it wraps it in a try/catch block trying model.execute() (which is also a sync function) first and await model.executeAsync() in catch block if model fails because it has any async ops (which does mean that for models that have async ops, it will try twice and totally skew results.)

Regarding "optimizations", yes, I think it would be much faster to store them to disk and reload them - but the scope is massive, it would pretty much become a new type of a precompiled model

Need to download all precompiled shaders and textures after the first run and then restore them as part of model loading. And not just GL save/restore, but also make sure that TFJS state (textureManager) is valid.

Why do I think it would be much faster? a) initial compile of shaders is time consuming b) uploading unchanged parts of textures is very time consuming
(just try setting WEBGL_DELETE_TEXTURE_THRESHOLD=0 so textures are deallocated on each frame and see the difference)

Also note that a shader code has a lot of conditional statements that test GPU GL capabilities and use appropriate functions accordingly, so such "precompiled" model would not be a generic WebGL model and would only work on newer GPUs (and on newer browsers and with modern drivers)

I love the idea, but I think the scope is just massive...

If I have to look forward to something, it would be the WebGPU backend - currently it's in early stages of development (and it only works on debug versions of browsers), but it could help significantly in the future.

rohanmuplara commented 3 years ago

Yes I agree with everything with you're saying. As you mentioned, in custom graph models is as async functions because it is calling executeAsync. Although there is an await within this function, there is no await outside it I think main culprit is this line https://github.com/tensorflow/tfjs/blob/38f8462fe642011ff1b7bcbb52e018f3451be58b/e2e/benchmarks/local-benchmark/index.html#L469. So in the case of custom graph models because it is an async function(using await internally) and there is no await outside it behaves unexpectedly.

Your point on the webgpu makes sesnse

vladmandic commented 3 years ago

predictFunc would not return timeInfo without await - see :

timeInfo = await timeInference(() => predict(model), numRuns);

where predict is:

    const start = performance.now();
    const res = await predict();
    const elapsedTime = performance.now() - start;
    times.push(elapsedTime);

where this predict is a conditional function that uses model.predict() or model.execute() or model.executeAsync() also, initial inference time is just timeInfo.times[0], there is no other handling for it.

rohanmuplara commented 3 years ago

Sure I think the point is that predict is called in the load function itself so the warmup time is actually the second time it is called which is a little unintuitive. In this call, there is no await used so for custom models this time of the first run isn't even counted in the load function. Ihttps://github.com/tensorflow/tfjs/blob/38f8462fe642011ff1b7bcbb52e018f3451be58b/e2e/benchmarks/local-benchmark/index.html#L469

vladmandic commented 3 years ago

possibly - i never relied on that e2e benchmark other when tfjs staff asked me to run it - i prefer to run my own benchmarks
anyhow, the core here is slow initial inference when using webgl backend and we've covered that

pyu10055 commented 3 years ago

@rohanmuplara @vladmandic @wingman-jr-addon great discussion here, seems the inference time measurements are bit confusing here. I Agree we should not pack the first inference into the model loading and there might be a bug in measuring as well. We will address those.

rohanmuplara commented 3 years ago

@vladmandic @pyu10055 @wingman-jr-addon thanks for all your help!

rohanmuplara commented 3 years ago

@vladmandic @pyu10055 @wingman-jr-addon https://share.descript.com/view/YT5OGQnajc3. To be clear, I only care about first time inference speed. I noticed if I have two models with same architectures, the second one's (first time) inference is really fast. I was wondering why that happened and how to reason about it. I believe that they have some weights that are similar and some weights that are different. I have a pipeline of 5 models so was wondering stuff such as if they have same exact architectures/different weights does it help with first time inference speed. https://github.com/rohanmuplara/tester/blob/tfjs_weird/graph/stitched.js

wingman-jr-addon commented 3 years ago

I was pretty sure I'd read someplace that the shader compilation happens the first time through. If so, I would expect that the weights are only data rather than a difference in shader, so that would align with your experience. I'm not familiar with the guts though. As an experiment, I suspect that with the WASM backend you will likely not see as much of a difference between model runs.

rohanmuplara commented 3 years ago

@wingman-jr-addon I am not too interested in wasm backend as it is significantly slower. So my question on your response is a. I had two different model references so somehow either tfjs or gpu have to figure out that these are the same b. when does it do this; ie if models have on more layer that is different? like how exactly same do the architectures have to be; trying to get intuition on this. Who would be best person to ping about this?

vladmandic commented 3 years ago

it's not per-model or per-layer, it's per kernel op - each kernel op is compiled into a shader.
so there is a lot that can be reused between models

and regarding weights, note that although that bin files may look very different, a lot of models start in the same place (same base weights which are only added on during training)
so if models are based on the same architecture (e.g. mobilenet v2 being extremely common starting point), they likely share >50% of weights as well

rohanmuplara commented 3 years ago

@vladmandic quick follow up. a. who does this kernel op duplication check is it it tfjs code/ or gpu? b. totally agree that weights are very similar; ie the architecture I used had was 99% mobilenet which by default loads pretrained weights; I was wondering if the weights were different, do you still get savings. If I have a pipeline of 5 models, they will all need to have different weights(they do different things) but if architectures are same do I get savings

vladmandic commented 3 years ago

it's not deduplication, it's simple compile of any used op on its first use - then it's registered in tfjs as existing so whoever calls it next, doesn't need to compile it again. that can be second usage of the same op from the same model or different model, doesn't matter. and there is always just one implementation of any given op - just try importing multiple instances of tfjs and you'll get tons of warnings in the console about 'op already registered'

regarding weights, they are not a monolithic thing, they are extracted as needed from a weights file to be used by a specific ops.
and when an op is compiled to a gl shader and it it has weights as param, those weights are uploaded as a gl texture and to a gpu. and tfjs maintains a map of uploaded textures.

so if a different model has 50% of different weights, but 50% of same weights as they are inherited from a common model, you do get significant saving.

to summarize - you get savings on a) compiling ops as shaders (which is irrelevant of weights), b) uploading weights as textures (which do must match)

disclaimer: all this is unofficial and comes from my experience with tfjs

google-ml-butler[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you.

google-ml-butler[bot] commented 2 years ago

Closing as stale. Please @mention us if this needs more attention.

tensorflow / tfjs

Difference between First Run on Benchmark Run and In Code #4907