tfjs webgl backend memory handling of complex models

vladmandic commented 4 years ago

for more complex object detection models such as faster_rcnn_inception_resnet_v2_atrous_oidv4
from TF Model Zoo (model stats are: numBytes 255,915,844 numTensors 17,653)

i cannot get valid run no matter what on a my notebook with a 4GB GPU using webgl backend

note that everything below works out-of-the-box in nodejs using tfjs-node or tfjs-node-gpu,
the issue is tfjs with backend webgl specific

and with tfjs-node, memory usage stays at 550MB during execution,
so nowhere close to 4+GB when using WebGL

why is webgl backend using so much more memory?

anyhow...

if i don't set WEBGL_DELETE_TEXTURE_THRESHOLD, i run into webgl out-of-memory
(confirmed by looking at GPU memory usage in Windows 10 task manager)
and loss of context resulting in standard error:

Error: Failed to compile fragment shader.
  createFragmentShader  @   webgl_util.js:82
  createProgram @   gpgpu_context.js:199
  compileProgram    @   gpgpu_math.js:44
  (anonymous)   @   backend_webgl.js:1851
  getAndSaveBinary  @   backend_webgl.js:1879
  runWebGLProgram   @   backend_webgl.js:1850
  compileAndRun @   backend_webgl.js:1874
  conv2d    @   backend_webgl.js:1494

that's 4GB exhausted in a single inference while non-webgl backends have no issues with less than 1GB

but if i do set WEBGL_DELETE_TEXTURE_THRESHOLD to 0 or any number below max available GPU memory,
i run into what looks like access violation due to access of deallocated shader/texture:

TypeError: Cannot read property '0' of undefined
  getPackedSampler2D    @   shader_compiler.js:615
  getPackedSamplerFromInInfo    @   shader_compiler.js:93
  getInputSamplingSnippet   @   shader_compiler.js:103
  (anonymous)   @   shader_compiler.js:36
  makeShader    @   shader_compiler.js:36
  compileProgram    @   gpgpu_math.js:43
  (anonymous)   @   backend_webgl.js:1851
  getAndSaveBinary  @   backend_webgl.js:1879
  runWebGLProgram   @   backend_webgl.js:1850
  compileAndRun @   backend_webgl.js:1874
  pad   @   backend_webgl.js:737

it looks that WEBGL_DELETE_TEXTURE_THRESHOLD is all-or-nothing when it comes to releasing objects,
there is no concept of tracking what is referenced or not

and if i try to reduce memory usage by using f16 by setting WEBGL_FORCE_F16_TEXTURES,
i get shape mismatch due to clipped values

Error: Size(442368) must match the product of shape 1,65504,1,4
  inferFromImplicitShape    @   util.ts:318
  reshape   @   Reshape.ts:35
  kernelFunc    @   engine.js:431
  (anonymous)   @   engine.js:483
  scopedRun @   engine.js:324
  runKernelFunc @   engine.js:481
  reshape_  @   reshape.js:58
  reshape__op   @   operation.js:44
  executeOp$g   @   transformation_executor.js:34
  (anonymous)   @   operation_executor.js:78

4GB GPU should be more than enough to handle 255MB model that executes in 550MB anywhere else but WebGL
(no matter how different WebGL is, >8x memory usage is not acceptable)

And to confirm results in tfjs-node, tf.profile() returns peak of 550MB

{ peakBytes 577269607 }

environment: TFJS 2.7.0 on windows 10 build 19042 with chrome 86

vladmandic commented 4 years ago

we can keep it simple - simpler model with multiple variations,
only initialized with checkpoints with higher and higher resolution

model are efficientdet family, converted from thub saved_model to graph_model using:

tensorflowjs_converter --signature_name=serving_default --input_format=tf_hub https://tfhub.dev/tensorflow/efficientdet/d0/1 efficientdet-d0

i wanted to get more data by running tf.profile(), but that only works for smallest d0 model and fails with out-of-memory for all others, can't even get d1 variation to run.

instead, here is the snapshot from task manager showing gpu memory usage during each run
and WEBGL_DELETE_TEXTURE_THRESHOLD=0 for most aggressive memory deallocation:

EfficientDet-D0, model size 19,378,390 bytes: success

EfficientDet-D1, model size 25,084,265 bytes: success

EfficientDet-D2, model size 30,825,917 bytes: success

EfficientDet-D3, model size 48,050,408 bytes: success

EfficientDet-D4, model size 83,665,253 bytes: error with webgl out-of-memory, happens at the end of inference run

EfficientDet-D5, model size 122,135,609 bytes error with webgl out-of-memory, happens almost immediately

all models (up to d7) run correctly and with no huge memory usage in nodejs

pyu10055 commented 4 years ago

@vladmandic Thank you for the detail report. @annxingyuan can you help to take a look what causes the GPU OOM during the run? thanks.

vladmandic commented 3 years ago

@pyu10055 @annxingyuan any updates?

annxingyuan commented 3 years ago

Hi @vladmandic - apologies for the delay. I've uploaded a test build of the WebGL backend here: https://storage.googleapis.com/learnjs-data/temp/tf-backend-webgl.es2017.memfix.js

Would you mind testing this out to see whether it fixes the getPackedSampler2D error you pasted above, when WEBGL_DELETE_TEXTURE_THRESHOLD is set to 0?

vladmandic commented 3 years ago

@annxingyuan

i cannot get your tf-backend-webgl.es2017.memfix.js to work with generic tfjs-core from tfjs 2.7.0, getting error: The kernel 'undefined' for backend 'webgl' is already registered and later it fails on first tensor operation as not implemented.

can you point me to your branch and i'll just do a full rebuild myself? it's easier than trying to mix&match.

annxingyuan commented 3 years ago

Sure - here you go: https://github.com/tensorflow/tfjs/pull/4240

vladmandic commented 3 years ago

build works, but unfortunately it doesn't improve webgl memory consumption - it's actually slightly higher in the first phase and the same in the second phase (where there is a max peak).

in both cases, deallocation triggered by WEBGL_DELETE_TEXTURE_THRESHOLD=0 works fine and webgl memory consuption returns to baseline. the excessive usage is only during the inference itself.

if anything, #4240 branch improves final deallocation a bit, but it's hard to tell.

tested with efficientdet-d2 and input picture with size 800px.

x-axis is ticks in seconds, y-axis is from 0 to 4GB (high baseline on my system is due to dual 4k monitors so idle system consumes ~1.2GB of GPU memory).

using tfjs 2.7.0

using tfjs from #4240 branch:

vladmandic commented 3 years ago

few more tests - seems that this proposed fix does solve one problem - if a model executes in 4GB, deallocation will now work and subsequent executions will continue to work.

but it doesn't solve core issue - why is there such an enormous gpu memory usage with webgl to start with compared to any other backend? i cannot get anything even remotely complex to execute within 4gb gpu, so deallocation at the end doesn't help.

vladmandic commented 3 years ago

also, not sure if its feasible given it's lossy compression and not sure how tfjs works with such textures, but perhaps it's worth taking a look at

vladmandic commented 3 years ago

@pyu10055 @annxingyuan @rthadur sorry to bug, but is there any update on this issue? no progress for over two months and it's pretty much a blocker for one of my projects as complex models simply cannot be used at all.

vladmandic commented 3 years ago

just to confirm, issue is pretty much the same in tfjs 3.1.0 as it was with tfjs 2.7.0 when i reported it, just the message changed from Error: Failed to compile fragment shader. (webgl_util.js:82) to Error: Failed to link vertex and fragment shaders. (webgl_util.js:117)

vladmandic commented 3 years ago

@rthadur @pyu10055

this issue (and similar one under #4129) is open since October and sitting idle without assignment?

vladmandic commented 3 years ago

@rthadur @jinjingforever @annxingyuan

one more ping regarding issue open since october 2020 without any progress?

jinjingforever commented 3 years ago

Sorry.. I currently don't have bandwidth to tackle this issue... Not sure if anybody else has time to take this? @rthadur @mattsoulanille @pyu10055

gaikwadrahul8 commented 1 year ago

Hi, @vladmandic

Apologize for the delayed response and we're re-visiting our older issues and checking whether those issues got resolved or not as of now so May I know are you still looking for the solution or your issue got resolved ?

If issue still persists after trying with latest version of TFJs please let us know with error log and code snippet to replicate the same issue from our end ?

Could you please confirm if this issue is resolved for you ? Please feel free to close the issue if it is resolved ? Thank you!

vladmandic commented 1 year ago

any improvements in this area would be very welcome, but i guess its one of those "it-is-what-it-is"...

tensorflow / tfjs

tfjs webgl backend memory handling of complex models #4166