tensorflow / tfjs

A WebGL accelerated JavaScript library for training and deploying ML models.
https://js.tensorflow.org
Apache License 2.0
18.36k stars 1.92k forks source link

tf.conv2d produces different results for browser and node #4843

Closed vladmandic closed 3 years ago

vladmandic commented 3 years ago

i've been chasing down why same image model (object detection) has slightly different results in browser and node environments and it comes down to results of tf.conv2d being slightly different for the exactly same inputs.

in browser, cpu, webgl and wasm backends produce identical results (and WEBGL_CONV_IM2COL has no effect). but tfjs-node using tensorflow backend produces different result.

example code:

console.log('Input:', x.shape, x.size, 'sum:', x.reshape([786432]).sum().dataSync()[0]); // input does not change (checked values)
console.log('Filter:', params.filters.shape, params.filters.size, 'sum:', params.filters.reshape([864]).sum().dataSync()[0]); // params do not change (checked values)
console.log('Strides', strides);

let out = tf.conv2d(x, params.filters, strides, 'same');

console.log('Conv2d 1st 5 values:', out.shape, out.size, out.dataSync().slice(0, 5)); // output has different values!
console.log('Conv2D sum of all values:', tf.reshape(out, [2097152]).sum().dataSync()[0]); // silly sum just to see how much results diverged

browser output:

Input: [ 1, 512, 512, 3 ] 786432 sum: -631754.625
Filter: [ 3, 3, 3, 32 ] 864 sum: 0.07897007465362549
Strides [ 2, 2 ]
Conv2d 1st 5 values: [ 1, 256, 256, 32 ] 2097152 Float32Array(5) [ 0.02585916966199875, 0, 0, 0, 0 ]
Conv2D sum of all values: -23342.779296875

node output:

Input: [ 1, 512, 512, 3 ] 786432 sum: -631754.625
Filter: [ 3, 3, 3, 32 ] 864 sum: 0.07897007465362549
Strides [ 2, 2 ]
Conv2d 1st 5 values: [ 1, 256, 256, 32 ] 2097152 Float32Array(5) [ 0.026323730126023293, 0, 0, 0, 0 ]
Conv2D sum of all values: -24542.615234375

you can see that value of just first entry is already different and that a simple checksum is off by ~1%

environment: tfjs 3.3.0 on chrome 89 and ubuntu 20.10

pyu10055 commented 3 years ago

@vladmandic I think this might be related to the input data, can you verify the input are the same for node and browser? I suspect fromPixels and decodeJPeg might produce different pixel values.

vladmandic commented 3 years ago

@pyu10055 that's the first thing i've thought of as well :)

and yes, decodeJpeg and fromPixels do produce different results - specifically, RGB values in fromPixels are offset by +1
i've also double-checked behavior of alignCorners and similar items when performing resizeBilinear

but i've handled that and that's why i'm printing the checksum of the input (after normalization) now - to confirm input is 100% identical
(if there were any differences, I'd have implemented something like canvas.js decoding which is uniform on both platform)

pyu10055 commented 3 years ago

@vladmandic the WebGL has precision loss when stored on texture, it is usually rather small. The input sum is negative seems to be weird, is it overflowing already?

vladmandic commented 3 years ago

@pyu10055

the WebGL has precision loss when stored on texture, it is usually rather small

The thing is WebGL and WASM produce results identical up to 5th decimal point (after that it's up to WebGL precision loss)
But tfjs-node produces results which are ~2-5% different than either WebGL or WASM which is not small

The input sum is negative seems to be weird, is it overflowing already?

Input here is just an image resized to 1x512x512x3 and normalized
Sum is just a (very) cheap way to do a hash to make sure inputs are same, but given the size of the array no wonder its overflowing.

But...

I've just tried with TFJS 3.5.0 where tfjs-node ships with TF2 and difference is almost gone
(sum of conv2d values now shows divergence of ~0.15% - that is at least 25x improvement)
(and more importantly, model predictions actually match)

So I guess bug was in TF1 implementation of conv2d - and finally updating TFJS to use TF2 resolved this issue as well

Feel free to close the issue

pyu10055 commented 3 years ago

that is great to know, thanks!

google-ml-butler[bot] commented 3 years ago

Are you satisfied with the resolution of your issue? Yes No