tensorflow / tfjs

A WebGL accelerated JavaScript library for training and deploying ML models.
https://js.tensorflow.org
Apache License 2.0
18.41k stars 1.92k forks source link

no data found for this tensor #141

Closed nsthorat closed 5 years ago

nsthorat commented 6 years ago

From @riatzukiza on March 18, 2018 18:52

Error

I was struggling with a memory leak, then I fix the leak, and I start getting this error.

bundle.js:14641 Uncaught (in promise) Error: WebGL backend: No data found for this tensor. Did you change your backend in the middle of the program? New backends can't use Tensors created with previous backends
    at MathBackendWebGL.throwIfNoData (bundle.js:14641)
    at MathBackendWebGL.readSync (bundle.js:14048)
    at MathBackendWebGL.<anonymous> (bundle.js:14093)
    at step (bundle.js:13922)
    at Object.next (bundle.js:13903)
    at fulfilled (bundle.js:13894)
    at <anonymous>

Code

I am rendering data to a canvas, so I have to call data every frame.

deeplearn logic

Logic to the implementation of conways game of life.

    var kernel = dl.reshape(dl.tensor2d([
        [1, 1, 1],
        [1, 0, 1],
        [1, 1, 1]
    ]), [3, 3, 1, 1]);
    var state0Tensor = dl.randomUniform([H, W]).greater(dl.scalar(0.5, "float32"));
    var state = dl.variable(dl.cast(dl.reshape(state0Tensor, [1, H, W, 1]), "float32"));
    var step = (function step$() {

        var newState = dl.tidy((() => {

            var neighbors = dl.conv2d(state, kernel, [1, 1, 1, 1], "same");
            var survive = dl.logicalAnd(dl.equal(state, dl.scalar(1, "float32")), dl.equal(neighbors, dl.scalar(2, "float32"))),
                born = dl.equal(neighbors, dl.scalar(3, "float32"));
            return dl.cast(dl.logicalOr(survive, born), "float32");

        }));
        state.assign(newState);
        newState.dispose();
        return state;
    });

rendering

Then I itterate over every element of tensor.data(), the error points at

return state.data().then(((d) => {
                            ^
    render(canvas = this.canvas, state = this.state, shape = this.shape, imageData = this.imageData, ctx = this.ctx) {

        if (!(running__QUERY)) {
            return false;
        };
        var height = shape[0],
            width = shape[1];
        return state.data().then(((d) => {

            var j = 0,
                k = 0;
            for (var i = 0; i < (width * height); ++(i)) {
                j = (i * 4);;
                this._renderCell(d[i], j, imageData)
            };
            return ctx.putImageData(imageData, 0, 0);

        }));

    }

Copied from original issue: tensorflow/tfjs-core#865

nsthorat commented 6 years ago

From @riatzukiza on March 18, 2018 19:1

The program still renders the sim, one would not know anything was happening if they did not open the debug tools

nsthorat commented 6 years ago

Hi @riatzukiza, moving this issue, but is your "render" call coming from inside a tidy()? What's likely happening is the tensor is disposed by the time the data() promise resolves.

grimmer0125 commented 6 years ago

Hi, I am new to TensorFlow and neural network and I get a similar error when I try to train my 2 output network, await this.nnet.model.fit(xTrain, [yTrain1, yTrain2], I do not know if this is due to may fault setting or a library's bug.

Error:

Uncaught Error: WebGL backend: No data found for this tensor. Did you change your backend in the middle of the program? New backends can't use Tensors created with previous backends
    at MathBackendWebGL../node_modules/@tensorflow/tfjs-core/dist/kernels/backend_webgl.js.MathBackendWebGL.throwIfNoData (backend_webgl.js:805)
    at MathBackendWebGL../node_modules/@tensorflow/tfjs-core/dist/kernels/backend_webgl.js.MathBackendWebGL.uploadToGPU (backend_webgl.js:811)
    at backend_webgl.js:756
    at Array.map (<anonymous>)
    at MathBackendWebGL../node_modules/@tensorflow/tfjs-core/dist/kernels/backend_webgl.js.MathBackendWebGL.compileAndRun (backend_webgl.js:755)
    at MathBackendWebGL../node_modules/@tensorflow/tfjs-core/dist/kernels/backend_webgl.js.MathBackendWebGL.conv2d (backend_webgl.js:681)
    at environment_1.ENV.engine.runKernel.x (conv.js:81)
    at Engine../node_modules/@tensorflow/tfjs-core/dist/engine.js.Engine.runKernel (engine.js:72)
    at ./node_modules/@tensorflow/tfjs-core/dist/ops/conv.js.ConvOps.conv2d (conv.js:81)
    at operation.js:11
    at Object../node_modules/@tensorflow/tfjs-core/dist/tracking.js.Tracking.tidy (tracking.js:36)
    at Object.descriptor.value [as conv2d] (operation.js:11)
    at Object.conv2dWithBias (tfjs_backend.js:812)
    at Conv2D../node_modules/@tensorflow/tfjs-layers/dist/layers/convolutional.js.Conv.call (convolutional.js:100)
    at topology.js:382
nsthorat commented 6 years ago

Could you share your full code with us?

grimmer0125 commented 6 years ago

Sure. Here is the explanation, and I'm sorry that there are too many comments to affect reading.

set up my model: https://github.com/grimmer0125/alphago-zero-tictactoe-js/blob/master/src/tictactoe/tensorflow/TicTacToeNNet.js#L39

train https://github.com/grimmer0125/alphago-zero-tictactoe-js/blob/master/src/tictactoe/tensorflow/NNet.js#L30

predict: https://github.com/grimmer0125/alphago-zero-tictactoe-js/blob/master/src/tictactoe/tensorflow/NNet.js#L94

demo site (using pretrained to predict works but self-train error): https://grimmer.io/alphago-zero-tictactoe-js/ The pretrained keras model's input is a little different than the one I set up. Its input's shape is [3x3], then reshape, then cnn. Mine is [3, 3, 1], then cnn.

I port it from another project and its training flow is

  1. run some simulation and get some output from model.prediction. Predictoin works here, too. Although it is using an untrained model to predict first.
  2. after getting enough data, start to train, then get the above error.
nsthorat commented 6 years ago

When I press the button "Start self-Train" I get no errors - is that the right button that throws the error?

grimmer0125 commented 6 years ago

Yes, it is the right button.

I just updated the site, and the original step1-simulation will run 25x3 episodes and may take 1, 2min. Now it runs 4x3 episodes and may take 15 seconds. After that, it will start to train. Please try again.

update: I have deleted some comments for a little better reading the code. Also, I remove the try-catch for that error, so you should be able to see that. If it does not appear, you may need to clean the cache or use the other browsers (the used react framework tries to save assets in cache).

grimmer0125 commented 6 years ago

I notice that backend_webgl.js.MathBackendWebGL.throwIfNoData indicates that some dataID is missing, and I search dataID in the library code and also apply some breakpoints,

When my model is setup, for example, in https://github.com/grimmer0125/alphago-zero-tictactoe-js/blob/master/src/tictactoe/tensorflow/TicTacToeNNet.js#L40

    const input = tf.input({ shape: [this.board_x, this.board_y, 1] });
    const h_conv1 = normalize1().apply(normalize1().apply(conv2d_padding().apply(input)));

Some actions will try to allocate tensor related objects, but

// tensor.js:
TensorBuffer.prototype.toTensor = function () {
    return Tensor.make(this.shape, { values: this.values }, this.dtype); // no pass dataID
};  

->

// tensor.js:
    Tensor.make = function (shape, data, dtype) {
        return new Tensor_1(shape, dtype, data.values, data.dataId); // no dataID
    };

->

// backend_webgl.js:
    MathBackendWebGL.prototype.register = function (dataId, shape, dtype) {
        if (this.texData.has(dataId)) {
            throw new Error('Data buffer is already registered');
        }
        this.texData.set(dataId, { // no dataID
            shape: shape,
            dtype: dtype,
            values: null,
            texture: null,
            texShape: null,
            texType: tex_util_1.TextureType.FLOAT
        });
    };   

This may be a wrong guess.

nsthorat commented 6 years ago

Apologies for the bad stack trace, I cloned your repo and am having trouble finding the TensorFlow.js entry point where this is throwing as well.

Can you reproduce this issue without the overhead of the rest of the app (since it's a little large it will take me a bit to wrap my head around). Could you reproduce with the same model topology and some dummy data? If you can do that I can look deeper, react is making the stack traces a little difficult.

grimmer0125 commented 6 years ago

Yes, these two are good suggestions.

When I am trying to simplify and use the dummy data, I notice a key point. Training works if I do not use model.predict several times first !!! Training throw exception if I use model.predict several times first and use dummy data to train Training works if I just call model.predict first without simulation works means calling model.fit succesffully and I just test to call it once.

The algorithm I use requires me to use model.predict first to get some initialy random game data, and this step seems to affect the following training.

So I can still try to simplfy the React part but the part of simulation+model.predict several times first may need to be kept to reproduct this issue.

grimmer0125 commented 6 years ago

I have created another branch excluding react and add a button to predict many times + train once. https://github.com/grimmer0125/alphago-zero-tictactoe-js/tree/simplifiedToTest

update: It is wired, there are two buttons in this version, what btn-A does is the same as the previous react ver. btn-B ignores the simulation process. They both run many times prediction and then start training. But Btn-A will throw the exception.

grimmer0125 commented 6 years ago

After comparing the differences between the related codes for these 2 buttons, I got the key difference., thank you for your suggestions again.

// Coach.js 
// this.pnet is the instance of NNetWrapper
this.pnet = deepcopy(this.nnet);  // !!!!!! <-key point !!!!!!!!!!!!
await this.nnet.train(flattenExamples); // start train 

// NNet.js
export class NNetWrapper extends NeuralNet {
  constructor(game) {
    // this.nnet is the instance of TicTacToeNNet, the same property name
    this.nnet = new TicTacToeNNet(game, args);
  }
}  

// TicTacToeNNet.js
export default class TicTacToeNNet {
  constructor(game, args) {
    this.model = tf.model({ inputs: input, outputs: [output1, output2] });
  } 
} 

If I remove the line of this.pnet = deepcopy(this.nnet);, traning will not throw any execpetion !!!!! (at least for my one time training test). Which means, if a object's object's property is tf.model, deepcopy this object will affect some internal state of TensorFlow.js/WebGL and the result is possible exceptioin.

Why I use deepcopy this object is to recoever the related tf.model to some saved state, if the following trained model is not good, it needs to go back to the previous status before training (alphago zero algorithm). Using deeocopy is my proposed workaround way, and the original Python version code uses tf.train.Saver().save/restore to achieve it (save/restore training weights).

nsthorat commented 6 years ago

Ahhh yes! That makes a lot of sense. Basically what happens is you deep copy the tensor, the data ID also gets copied, but TensorFlow.js doesn't know about it.

So the first Tensor gets cleaned up, destroying that data bucket (keyed by data ID). The next time you access the second data bucket we don't know about it.

Just FYI, you can use ".clone()" to clone a Tensor. It will return a new Tensor, however clone() is extremely cheap. Under the covers we create another "shell" Tensor pointing to the same data ID.

Nice job finding that!

On Mon, Apr 30, 2018 at 8:39 AM, Grimmer notifications@github.com wrote:

After comparing the differences between the related codes for these 2 buttons, I got the key difference., thank you for your suggestions again.

// Coach.js // this.pnet is the instance of NNetWrapper this.pnet = deepcopy(this.nnet); //<-key point await this.nnet.train(flattenExamples); // start train

// NNet.js export class NNetWrapper extends NeuralNet { constructor(game) { // this.nnet is the instance of TicTacToeNNet, the same property name this.nnet = new TicTacToeNNet(game, args); } }

// TicTacToeNNet.js export default class TicTacToeNNet { constructor(game, args) { this.model = tf.model({ inputs: input, outputs: [output1, output2] }); } }

If I remove the line of this.pnet = deepcopy(this.nnet);, traning will not throw execpetions !!!!! (at least for my one time training test). Which means, if a object's object's property is tf.model, deepcopy this object will affect some internal state of TensorFlow.js/WebGL and the result is possible exceptioin.

Why I use deepcopy this object is to recoever the related tf.model to some saved state, if the following trained model is not good, it needs to go back to the previous status before training. Using deeocopy is my proposed workaround way, and the original Python version code uses tf.train.Saver().save/restore to achieve it.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/tensorflow/tfjs/issues/141#issuecomment-385386480, or mute the thread https://github.com/notifications/unsubscribe-auth/ABDLzcf9Vf_r2NfuFSo07jN62b8hgb0Hks5ttwXsgaJpZM4TLHXA .

dsmilkov commented 5 years ago

Since this issue, we've done global tracking of tensors, as well as transferring tensors between backends, so this error is likely outdated.