pku-liang / FlexTensor

Automatic Schedule Exploration and Optimization Framework for Tensor Computations
MIT License
174 stars 32 forks source link

Some problem while running on GPU #17

Open onlyoh opened 4 years ago

onlyoh commented 4 years ago

I want to test the performance of C9 of yolo after FlexTensor's optimization, but there seems to be some problems when running optimize_conv2d.py on GPU

$ python optimize_conv2d.py --shapes yolo --from 8 --to 9 --parallel 16 --target cuda
......
Warning: No valid schedule found in warm up process, please use more trials
Now automatically use more trials, increase 16
warm up [1599394505.223908] [ inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf ]
Warning: No valid schedule found in warm up process, please use more trials
Now automatically use more trials, increase 16
warm up [1599394508.009939] [ inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf ]
Warning: No valid schedule found in warm up process, please use more trials
Now automatically use more trials, increase 16
warm up [1599394510.781969] [ inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf ]
Warning: No valid schedule found in warm up process, please use more trials
Now automatically use more trials, increase 16
Fail to find valid schedule, too many errors
warm up [1599394513.576313] [ inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf ]
Warning: No valid schedule found in warm up process, please use more trials
Now automatically use more trials, increase 16
warm up [1599394516.424372] [ inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf ]
Warning: No valid schedule found in warm up process, please use more trials
Now automatically use more trials, increase 16
......

I have seen a previous issue and the current code uses 'spawn' when using multiprocessing. It seems that it will not stop running because it can't find a suitable schedule.

KnowingNothing commented 4 years ago

Please check your nvcc by typing nvcc --version in your terminal, if nvcc is not available, the codegen of tvm will fail.

onlyoh commented 4 years ago

The result of nvcc --version is :

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130

And excuting print(torch.version.cuda) in python interpreter outputs :

10.0.130
KnowingNothing commented 4 years ago

How about setting a larger timeout? Just try to add --timeout 20 in your command.

onlyoh commented 4 years ago

This method does not seem to work...

KnowingNothing commented 4 years ago

Then I'd suggest that you uncomment the two #print(msg)s in scheduler.py, and then tell me the error message, if any.

onlyoh commented 4 years ago

It outputs these messages:

Optimize yolo convolution layer 9 shape (1, 512, 28, 28, 512, 512, 1, 1, 1, 1, 0, 1, 1)
graph space size 2
op 0 space size: 25344000
[Warning] Directory lib is not empty, but reusing it
op build fail:module 'tvm.tir' has no attribute 'ir_pass'
op build fail:module 'tvm.tir' has no attribute 'ir_pass'
op build fail:module 'tvm.tir' has no attribute 'ir_pass'
op build fail:module 'tvm.tir' has no attribute 'ir_pass'
op build fail:module 'tvm.tir' has no attribute 'ir_pass'
op build fail:module 'tvm.tir' has no attribute 'ir_pass'
op build fail:module 'tvm.tir' has no attribute 'ir_pass'
op build fail:module 'tvm.tir' has no attribute 'ir_pass'
op build fail:module 'tvm.tir' has no attribute 'ir_pass'
op build fail:module 'tvm.tir' has no attribute 'ir_pass'
op build fail:module 'tvm.tir' has no attribute 'ir_pass'
op build fail:module 'tvm.tir' has no attribute 'ir_pass'
op build fail:module 'tvm.tir' has no attribute 'ir_pass'
op build fail:module 'tvm.tir' has no attribute 'ir_pass'
op build fail:module 'tvm.tir' has no attribute 'ir_pass'
op build fail:module 'tvm.tir' has no attribute 'ir_pass'
op build fail:module 'tvm.tir' has no attribute 'ir_pass'
op build fail:module 'tvm.tir' has no attribute 'ir_pass'
op build fail:module 'tvm.tir' has no attribute 'ir_pass'
op build fail:module 'tvm.tir' has no attribute 'ir_pass'
warm up [1599727687.361282] [ inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf ]
op build fail:module 'tvm.tir' has no attribute 'ir_pass'
op build fail:module 'tvm.tir' has no attribute 'ir_pass'
......
KnowingNothing commented 4 years ago

I see. TVM is under fast development and the API keeps changing. To use FlexTensor, you can try TVM (commit 89da63e228eae2b0b4fe39770031a042858c52a7).

onlyoh commented 4 years ago

Thanks, I will try it!

hecmay commented 4 years ago

A follow-up to this issue. I got the following errors when running the same example. I am using TVM v0.7 (not exactly the commit you recommended). What would be the reason to have those null error messages?

$ python optimize_conv2d.py --shapes yolo --from 8 --to 9 --parallel 16 --target cuda
Optimize yolo convolution layer 9 shape (1, 512, 28, 28, 512, 512, 1, 1, 1, 1, 0, 1, 1)
graph space size 2
op 0 space size: 25344000
[Warning] Directory lib is not empty, but reusing it
op build fail:
op build fail:
op build fail:
op build fail:
op build fail:
op build fail:
op build fail:
op build fail:
op build fail:
op build fail:
op build fail:
op build fail:                                                                                        op build fail:
op build fail:
op build fail:
op build fail:
op build fail:
op build fail:
op build fail:
op build fail:
warm up [1600224234.227920] [ inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf
inf inf ]
KnowingNothing commented 4 years ago

Did you check your nvcc?