tensorflow / tfjs

A WebGL accelerated JavaScript library for training and deploying ML models.
https://js.tensorflow.org
Apache License 2.0
18.44k stars 1.92k forks source link

TFDF sets global state on async model execution #7571

Open mattsoulanille opened 1 year ago

mattsoulanille commented 1 year ago

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

Describe the current behavior TFDF sets global state when it executes an async model. This global state is used as an input to the SimpleMLLoadModelFromPathWithHandle (which also sets global state) and SimpleMLInferenceOpWithHandle ops.

This is a problem because async models can run at the same time as each other. It's possible that a model might be architected such that it sets the global state, yields to another task, and then reads the global state. This is a race condition because if the other task also sets the global state, the model will read a different global state and produce a bad result.

Describe the expected behavior TFDF models should not rely on mutable global state.

One solution is to register a custom op for each TFDF model that refers to its own state instead of the global state. This still could have issues if model.executeAsync is called once and then again before the first run is finished, but it looks like we're just storing the TFDF model in this state so we can run it during inference (inference does not actually change the model). In this case, it would be fine.

This approach will let us remove the separate TFDFModel class that wraps the TFJS GraphModel and just sets the global state before calling it because there will no longer be any global state to set (each model will store the state in its custom op).

Another approach is to store the assets and model runner in an constant node or a resource and add it as an input to the SimpleMLLoadModelFromPathWithHandle and SimpleMLInferenceOpWithHandle ops. This is probably better than the first approach I mentioned since it doesn't require registering custom ops.

I'd also like to remove loadTFDFModel and just rely on loadGraphModel if possible, but that's probably out of scope here.

Standalone code to reproduce the issue Provide a reproducible test case that is the bare minimum necessary to generate the problem. If possible, please share a link to Colab/CodePen/any notebook.

Other info / logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

pyu10055 commented 1 year ago

Looks like the asset and modelRunner should be in a lookup table