[BUG] Incorrect output for resnet50-float16

LeiWang1999 commented 2 years ago

🐛 Incorrect output for resnet50-float16

I have already generated cuda code from a resnet50-float16.onnx description, but the execution of main_test gave an incorrect output compared with onnxruntime's output, according to my examination, this is caused by the Constantxx.bin which the value of it stored is all zero.

Pick a Constant_0_0.bin (the first param to be loaded in cuda code) as example.

For fp32 cuda code gen, the value of Constant_0_0.bin is :

-3.080576e-01 7.984307e-02 -1.190038e+00 -1.483669e+00 -5.135900e-01 3.682718e-01 -2.163917e+00 -8.705013e-01 -1.881243e+00 -1.607672e-01

For fp16 cuda code gen, the value of Constant_0_0.bin is :

-0.000000e+00 4.591775e-41 0.000000e+00 4.591775e-41 0.000000e+00 -0.000000e+00 4.591775e-41 -0.000000e+00 4.591775e-41 0.000000e+00

I traced the program but it's hard for me to find the code which generated the value of constant.bin (actually I can find the code to generate the file, it's located in cuda code gen pass, but I didn't find the code which generate the data of the constant).

Any suggestions for this progress? How can I find the progress of constant data generation ?I guess the progress of data generation may be located in one of the graph passes?

mzmssg commented 2 years ago

Basically, the constants were generated from two sources.

Converted from onnx constant/initializers, the code located around https://github.com/microsoft/nnfusion/blob/master/src/nnfusion/frontend/onnx_import/util/graph_convert.cpp#L372-L379
Additional constant added by NNFusion, mostly caused by the consumer op of this constant

I think you can debug from these two directions.

LeiWang1999 commented 2 years ago

Thanks @mzmssg , I found the problem. https://github.com/microsoft/nnfusion/blob/c3cd7155303e3af2f56c9f9bf0778656c372bc14/src/nnfusion/frontend/onnx_import/util/util.cpp#L98

It considered fp16 data as fp32, and now it's fixed by this pr #443

resnet50-fp16.onnx is passed and the output of it is correct now, but from my point of view, there are a better improvement to define a series of NNFusion DataType, for example:

::nnfusion::datatype::FLOAT_64
::nnfusion::datatype::FLOAT_32
::nnfusion::datatype::FLOAT_16

Because currently they are double, float, half_float::half, it's not programming friendly I think.

microsoft / nnfusion

[BUG] Incorrect output for resnet50-float16 #431