tensorflow / tfjs

A WebGL accelerated JavaScript library for training and deploying ML models.
https://js.tensorflow.org
Apache License 2.0
18.46k stars 1.92k forks source link

Model.fit doesn't train properly likely due to GL_INVALID_OPERATION error #1381

Closed Chrisvin closed 5 years ago

Chrisvin commented 5 years ago

I followed the steps from "https://codelabs.developers.google.com/codelabs/tfjs-training-classfication/index.html?index=..%2F..index#0" to make CNN that can be used for digit recognition.

The model used to work, but it doesn't work now. Some sort of problem occurs during the model's training, the loss and accuracy values (shown using tensorflow/tfjs-vis) drop drastically and "[.WebGL-0000016FAEA3AAA0] GL_INVALID_OPERATION: Object cannot be used because it has not been generated." is thrown multiple times in the browser console. image

Once the training completes and evaluation of the model (using tjfs-vis per class accuracy and confusion matrix) begins, the screen completely blacks out for a split second and the expected output is not generated. image

I have not changed the code since I last checked this a few weeks ago(when it worked perfectly). And the same issue arises in another project I am working on as well. Any help or advice as to what's going wrong would be much appreciated.

caisq commented 5 years ago

@Chrisvin Thanks for reporting this issue.

What device are you using?What is your operating system and chrome version?

Chrisvin commented 5 years ago

Device - Dell Latitude 7490 (I have also checked on a Lenovo Yoga 720 which also had the same issues) OS - Windows 10 Chrome - Version 72.0.3626.121 (Official Build) (64-bit)

caisq commented 5 years ago

@Chrisvin Just to cover the ground, have you tried restarting Chrome and/or rebooting system and see whether it resolves the problem?

Chrisvin commented 5 years ago

Yes, I've tried restarting chrome, clearing it's cache and rebooting the system. The reason I tried on a different laptop was to try to figure out if it was a problem related to my system.

Chrisvin commented 5 years ago

@caisq Also, I have used the script tag as follows to use TensorFlow.js and tfjs-vis.

<script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs@1.0.0/dist/tf.min.js"></script>
<script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs-vis@1.0.2/dist/tfjs-vis.umd.min.js"></script>

I have also tried using older versions as follows but it also results in the same issues,

<script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs@0.14.2/dist/tf.min.js"></script>
<script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs-vis@0.4.2/dist/tfjs-vis.umd.min.js"></script>
Chrisvin commented 5 years ago

@caisq Just to ensure that I'm not doing something dumb with my code, I've checked some of the TensorFlow.js examples and some of the examples don't work properly.

For example, the MNIST example doesn't train properly and exhibits weird/unexpected behavior ('random' drops in accuracy and loss values, split second screen blackening and multiple graph creation) as shown below, image Similar issues can be seen in MNIST CNN Transfer Learning example.

This leads me to believe that the issue is on the side of tensorflow.js, Any updates related to this would be appreciated.

caisq commented 5 years ago

cc @dsmilkov @nsthorat for thoughts on the lower-level WebGL issue.

bileschi commented 5 years ago

@dsmilkov : can you triage and assign this to the right person / priority? Thanks.

nsthorat commented 5 years ago

Thanks for reporting this issue -- we're working on getting our windows machine up and running and will get back to you ASAP.

caisq commented 5 years ago

@Chrisvin Just a thought: what graphic(s) card do you have on your machine? Is it possible that there are two and switching to another one in your Windows settings might help?

@nsthorat @dsmilkov

Chrisvin commented 5 years ago

@caisq The Dell Latitude 7490 has a Intel(R) UHD Graphics 620. The Lenovo Yoga 720 has a Intel(R) HD Graphics 620. So, no, there isn't two graphic cards. Besides, all of these examples and models used to work perfectly well, so I assumed that these graphic cards were fully capable of handling these tasks, or am I misunderstanding something?

dsmilkov commented 5 years ago

Hi @Chrisvin , we are waiting for a windows machine to arrive so we can reproduce this problem.

In the meantime, since you mentioned that this used to work, if you can test and find the latest version of tf.js that makes your demo work (0.15.3, 0.15.2, 0.15.1, 0.15.0, 0.14.2,...?), that would be very helpful to us. Thank you!

Chrisvin commented 5 years ago

@caisq @nsthorat @dsmilkov The TensorFlow.js examples and my projects have started working properly again. So, I guess this issue can be closed now. Thanks for the quick replies and support guys.

dsmilkov commented 5 years ago

Thanks Chrisvin, I'll close the issue - feel free to reopen if you see the problem again. If you find out what changed this time vs last time, please share it with us so we can improve our library and make sure this doesn't happen again.