Investigating ShuffleNet accuracy issue with MLIR

gyulaz-htec commented 5 months ago

The problem is described in this issue: https://github.com/ROCm/AMDMIGraphX/issues/2692 For the investigation use the following branch: https://github.com/gyulaz-htec/models/tree/shufflenet_bisect Steps to prudoce verify logs for the first faling instruction in the model graph:

# 1. Pull the model:
git lfs-pull --include="vision/classification/shufflenet/model/shufflenet-v2-12.onnx" --exclude=""

# 2. Run the bisecting script:
python3 model_bisect.py --good 720 --bad 638
# This will output the failing line and the command to get the verify log with the first failing instruction

# 3. Run the command printed by the previous step and save it to a flle. 
MIGRAPHX_TRACE_EVAL=2 /code/AMDMIGraphX/build/bin/migraphx-driver verify vision/classification/shufflenet/model/shufflenet-v2-12.onnx --trim 689 > failing.log

# 4. Run the same command but for the previous instruction (`--trim 690`) and save that as well: 
MIGRAPHX_TRACE_EVAL=2 /code/AMDMIGraphX/build/bin/migraphx-driver verify vision/classification/shufflenet/model/shufflenet-v2-12.onnx --trim 690 > passing.log

# 5. Compare the two logs manualy

attila-dusnoki-htec commented 5 months ago

Running with migraphx-driver verify --reduce shufflenet-v2-12.onnx, the last fail will be at 716th iteration.

Log: shufflenet_verify_716.log

Not exactly where the above comment pinpointed, but close.

attila-dusnoki-htec commented 5 months ago

Updated the original issue with the details.

gyulaz-htec commented 4 months ago

Fixed by https://github.com/ROCm/rocMLIR/pull/1403

migraphx-benchmark / AMDMIGraphX

Investigating ShuffleNet accuracy issue with MLIR #164