Closed nsthorat closed 5 years ago
From @riatzukiza on March 18, 2018 19:1
The program still renders the sim, one would not know anything was happening if they did not open the debug tools
Hi @riatzukiza, moving this issue, but is your "render" call coming from inside a tidy()? What's likely happening is the tensor is disposed by the time the data() promise resolves.
Hi, I am new to TensorFlow and neural network and I get a similar error when I try to train my 2 output network, await this.nnet.model.fit(xTrain, [yTrain1, yTrain2]
, I do not know if this is due to may fault setting or a library's bug.
Error:
Uncaught Error: WebGL backend: No data found for this tensor. Did you change your backend in the middle of the program? New backends can't use Tensors created with previous backends
at MathBackendWebGL../node_modules/@tensorflow/tfjs-core/dist/kernels/backend_webgl.js.MathBackendWebGL.throwIfNoData (backend_webgl.js:805)
at MathBackendWebGL../node_modules/@tensorflow/tfjs-core/dist/kernels/backend_webgl.js.MathBackendWebGL.uploadToGPU (backend_webgl.js:811)
at backend_webgl.js:756
at Array.map (<anonymous>)
at MathBackendWebGL../node_modules/@tensorflow/tfjs-core/dist/kernels/backend_webgl.js.MathBackendWebGL.compileAndRun (backend_webgl.js:755)
at MathBackendWebGL../node_modules/@tensorflow/tfjs-core/dist/kernels/backend_webgl.js.MathBackendWebGL.conv2d (backend_webgl.js:681)
at environment_1.ENV.engine.runKernel.x (conv.js:81)
at Engine../node_modules/@tensorflow/tfjs-core/dist/engine.js.Engine.runKernel (engine.js:72)
at ./node_modules/@tensorflow/tfjs-core/dist/ops/conv.js.ConvOps.conv2d (conv.js:81)
at operation.js:11
at Object../node_modules/@tensorflow/tfjs-core/dist/tracking.js.Tracking.tidy (tracking.js:36)
at Object.descriptor.value [as conv2d] (operation.js:11)
at Object.conv2dWithBias (tfjs_backend.js:812)
at Conv2D../node_modules/@tensorflow/tfjs-layers/dist/layers/convolutional.js.Conv.call (convolutional.js:100)
at topology.js:382
Could you share your full code with us?
Sure. Here is the explanation, and I'm sorry that there are too many comments to affect reading.
set up my model: https://github.com/grimmer0125/alphago-zero-tictactoe-js/blob/master/src/tictactoe/tensorflow/TicTacToeNNet.js#L39
demo site (using pretrained to predict works but self-train error): https://grimmer.io/alphago-zero-tictactoe-js/ The pretrained keras model's input is a little different than the one I set up. Its input's shape is [3x3], then reshape, then cnn. Mine is [3, 3, 1], then cnn.
I port it from another project and its training flow is
model.prediction
. Predictoin works here, too. Although it is using an untrained model to predict first. When I press the button "Start self-Train" I get no errors - is that the right button that throws the error?
Yes, it is the right button.
I just updated the site, and the original step1-simulation will run 25x3 episodes and may take 1, 2min. Now it runs 4x3 episodes and may take 15 seconds. After that, it will start to train. Please try again.
update: I have deleted some comments for a little better reading the code. Also, I remove the try-catch for that error, so you should be able to see that. If it does not appear, you may need to clean the cache or use the other browsers (the used react framework tries to save assets in cache).
I notice that
backend_webgl.js.MathBackendWebGL.throwIfNoData
indicates that some dataID
is missing, and I search dataID
in the library code and also apply some breakpoints,
When my model is setup, for example, in https://github.com/grimmer0125/alphago-zero-tictactoe-js/blob/master/src/tictactoe/tensorflow/TicTacToeNNet.js#L40
const input = tf.input({ shape: [this.board_x, this.board_y, 1] });
const h_conv1 = normalize1().apply(normalize1().apply(conv2d_padding().apply(input)));
Some actions will try to allocate tensor related objects, but
// tensor.js:
TensorBuffer.prototype.toTensor = function () {
return Tensor.make(this.shape, { values: this.values }, this.dtype); // no pass dataID
};
->
// tensor.js:
Tensor.make = function (shape, data, dtype) {
return new Tensor_1(shape, dtype, data.values, data.dataId); // no dataID
};
->
// backend_webgl.js:
MathBackendWebGL.prototype.register = function (dataId, shape, dtype) {
if (this.texData.has(dataId)) {
throw new Error('Data buffer is already registered');
}
this.texData.set(dataId, { // no dataID
shape: shape,
dtype: dtype,
values: null,
texture: null,
texShape: null,
texType: tex_util_1.TextureType.FLOAT
});
};
This may be a wrong guess.
Apologies for the bad stack trace, I cloned your repo and am having trouble finding the TensorFlow.js entry point where this is throwing as well.
Can you reproduce this issue without the overhead of the rest of the app (since it's a little large it will take me a bit to wrap my head around). Could you reproduce with the same model topology and some dummy data? If you can do that I can look deeper, react is making the stack traces a little difficult.
Yes, these two are good suggestions.
When I am trying to simplify and use the dummy data, I notice a key point.
Training works if I do not use model.predict several times first !!!
Training throw exception if I use model.predict several times first and use dummy data to train
Training works if I just call model.predict first without simulation
works
means calling model.fit succesffully and I just test to call it once.
The algorithm I use requires me to use model.predict first to get some initialy random game data, and this step seems to affect the following training.
So I can still try to simplfy the React part but the part of simulation+model.predict several times first
may need to be kept to reproduct this issue.
I have created another branch excluding react and add a button to predict many times + train once. https://github.com/grimmer0125/alphago-zero-tictactoe-js/tree/simplifiedToTest
update: It is wired, there are two buttons in this version, what btn-A does is the same as the previous react ver. btn-B ignores the simulation process. They both run many times prediction and then start training. But Btn-A will throw the exception.
After comparing the differences between the related codes for these 2 buttons, I got the key difference., thank you for your suggestions again.
// Coach.js
// this.pnet is the instance of NNetWrapper
this.pnet = deepcopy(this.nnet); // !!!!!! <-key point !!!!!!!!!!!!
await this.nnet.train(flattenExamples); // start train
// NNet.js
export class NNetWrapper extends NeuralNet {
constructor(game) {
// this.nnet is the instance of TicTacToeNNet, the same property name
this.nnet = new TicTacToeNNet(game, args);
}
}
// TicTacToeNNet.js
export default class TicTacToeNNet {
constructor(game, args) {
this.model = tf.model({ inputs: input, outputs: [output1, output2] });
}
}
If I remove the line of this.pnet = deepcopy(this.nnet);
, traning will not throw any execpetion !!!!! (at least for my one time training test). Which means, if a object's object's property is tf.model
, deepcopy
this object will affect some internal state of TensorFlow.js/WebGL
and the result is possible exceptioin.
Why I use deepcopy
this object is to recoever the related tf.model to some saved state, if the following trained model is not good, it needs to go back to the previous status before training (alphago zero algorithm). Using deeocopy
is my proposed workaround way, and the original Python version code uses tf.train.Saver().save/restore
to achieve it (save/restore training weights).
Ahhh yes! That makes a lot of sense. Basically what happens is you deep copy the tensor, the data ID also gets copied, but TensorFlow.js doesn't know about it.
So the first Tensor gets cleaned up, destroying that data bucket (keyed by data ID). The next time you access the second data bucket we don't know about it.
Just FYI, you can use ".clone()" to clone a Tensor. It will return a new Tensor, however clone() is extremely cheap. Under the covers we create another "shell" Tensor pointing to the same data ID.
Nice job finding that!
On Mon, Apr 30, 2018 at 8:39 AM, Grimmer notifications@github.com wrote:
After comparing the differences between the related codes for these 2 buttons, I got the key difference., thank you for your suggestions again.
// Coach.js // this.pnet is the instance of NNetWrapper this.pnet = deepcopy(this.nnet); //<-key point await this.nnet.train(flattenExamples); // start train
// NNet.js export class NNetWrapper extends NeuralNet { constructor(game) { // this.nnet is the instance of TicTacToeNNet, the same property name this.nnet = new TicTacToeNNet(game, args); } }
// TicTacToeNNet.js export default class TicTacToeNNet { constructor(game, args) { this.model = tf.model({ inputs: input, outputs: [output1, output2] }); } }
If I remove the line of this.pnet = deepcopy(this.nnet);, traning will not throw execpetions !!!!! (at least for my one time training test). Which means, if a object's object's property is tf.model, deepcopy this object will affect some internal state of TensorFlow.js/WebGL and the result is possible exceptioin.
Why I use deepcopy this object is to recoever the related tf.model to some saved state, if the following trained model is not good, it needs to go back to the previous status before training. Using deeocopy is my proposed workaround way, and the original Python version code uses tf.train.Saver().save/restore to achieve it.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/tensorflow/tfjs/issues/141#issuecomment-385386480, or mute the thread https://github.com/notifications/unsubscribe-auth/ABDLzcf9Vf_r2NfuFSo07jN62b8hgb0Hks5ttwXsgaJpZM4TLHXA .
Since this issue, we've done global tracking of tensors, as well as transferring tensors between backends, so this error is likely outdated.
From @riatzukiza on March 18, 2018 18:52
Error
I was struggling with a memory leak, then I fix the leak, and I start getting this error.
Code
I am rendering data to a canvas, so I have to call data every frame.
deeplearn logic
Logic to the implementation of conways game of life.
rendering
Then I itterate over every element of
tensor.data()
, the error points atCopied from original issue: tensorflow/tfjs-core#865