Cuda 9.2 PTX issue - Githubissues

aryamazaheri commented 5 years ago

I tried to run boda using cuda backend and apparently something has broken, maybe after migrating to the latest version of cuda(?). Here is the error message:

boda cnn_op_info --op-tune=(use_be=nvrtc,k1conv=1,tconv=1) --rtc=(be=nvrtc,gen_src=1)
error: cuModuleLoadDataEx() failed with ret=CUDA_ERROR_INVALID_PTX (a PTX JIT compilation failed)

I am wondering why cuda is generating an invalid ptx output that cannot be run using boda. Do you know the reason?

aryamazaheri commented 5 years ago

I also manually compiled the generated CUCL code to ptx using nvcc --ptx command and it is surprising to me that the ptx file generated by boda is different from the one generated by nvcc.

moskewcz commented 5 years ago

hmm, i'm not sure offhand. certainly, there can be issues with making sure you're using valid compilation/arch setting for whatever card you are using, and some of that might be hard-coded in a way that could break on an upgrade. but, more to the point, when errors like this happen, i think you'll generally need to investigate the actual CUDA and/or PTX code and see what's up. also, it'd be interesting to know if all cases fail (i.e the simplest unit tests that exercise be=nvrtc, like test_rtc_nvrtc), or only complex ones, or what.

to say more i guess i'll need a test case i can replicate and/or at least i'll need to see more details for a failing example.

FWIW, i'm using cuda 9.1, and at least basis RTC stuff still seems to work fine on boda HEAD, although i'm not currently using/testing it day-to-day. i might be able to install/test cuda 9.2 soon. but, it might be more a hardware-related issue, so i'd be curious about the compute capabilities/etc of the card you're using and/or if that changes recently.

after than, inspecting the compile options and such in nvrtc_util.cc might be a good place to look -- maybe the hard-codes just aren't good for cuda 9.2 + your-card-arch.

any, as usual, testing with a clean slate is good: clear your compute/NV cache, make sure you're not mixing parts of different CUDA versions, etc ...

moskewcz@mazda5:~/git-work/boda/run/tr1$ time boda test_cmds --verbose=999 --filt='.*rtc.*' ; date
(test_name=test_rtc_nvrtc,needs=nvrtc,command=(mode=rtc_test,boda_output_dir=%(test_name),prog_fn=%(boda_test_dir)/nvrtc_test_dot.cu),cli_str=boda rtc_test --rtc='(be=nvrtc)' --prog-fn='%(boda_test_dir)/nvrtc_test_dot.cu)
(test_name=test_rtc_ocl,command=(mode=rtc_test,boda_output_dir=%(test_name),prog_fn=%(boda_test_dir)/ocl_test_dot.cl,rtc=(be=ocl)),cli_str=boda rtc_test --rtc='(be=ocl)' --prog-fn='%(boda_test_dir)/ocl_test_dot.cl')
(test_name=test_rtc_cucl_nvrtc,needs=nvrtc,command=(mode=rtc_test,boda_output_dir=%(test_name)),cli_str=boda rtc_test --rtc='(be=nvrtc)' )
(test_name=test_rtc_cucl_ocl,command=(mode=rtc_test,boda_output_dir=%(test_name),rtc=(be=ocl)),cli_str=boda rtc_test --rtc='(be=ocl)' )
(test_name=test_rtc_cucl_ocl_struct,command=(mode=rtc_test,boda_output_dir=%(test_name),rtc=(be=ocl),func_name=my_dot_struct),cli_str=boda rtc_test --rtc='(be=ocl)' --func-name=my_dot_struct )
(test_name=test_rtc_cucl_ipc,command=(mode=rtc_test,boda_output_dir=%(test_name),rtc=(be=ipc)),cli_str=boda rtc_test --rtc='(be=ipc)' )
(test_name=test_rtc_cucl_ipc_tcp,command=(mode=rtc_test,boda_output_dir=%(test_name),rtc=(be=ipc,boda_parent_addr=tcp:127.0.0.1:12791)),cli_str=boda rtc_test --rtc='(be=ipc,boda_parent_addr=tcp:127.0.0.1:12791)' )
(test_name=test_dense_boda_rtc_1,command=(mode=test_dense,boda_output_dir=%(test_name),imgs=(mode=test_dense,boda_output_dir=%(test_name)),run_cnet=(mode=test_dense,boda_output_dir=%(test_name),in_dims=(img=1)),run_cnet_dense=(mode=test_dense,boda_output_dir=%(test_name),in_dims=(img=1)),wins_per_image=10000,mrd_toler=5e-05),cli_str=boda test_dense --model-name=nin_imagenet_nopad --wins_per_image=10000 --in_dims='(img=1)' --conv_fwd='(mode=rtc)' --run_cnet='()' --run_cnet_dense='()')
(test_name=test_dense_boda_rtc_2,command=(mode=test_dense,boda_output_dir=%(test_name),imgs=(mode=test_dense,boda_output_dir=%(test_name)),run_cnet=(mode=test_dense,boda_output_dir=%(test_name),in_dims=(img=1,x=227,y=227),out_node_name=cccp8),run_cnet_dense=(mode=test_dense,boda_output_dir=%(test_name),in_dims=(img=1,x=227,y=227),out_node_name=cccp8),wins_per_image=10000,mrd_toler=5e-05),cli_str=boda test_dense --model-name=nin_imagenet --wins_per_image=10000 --in_dims='(img=1,y=227,x=227)' --out_node_name=cccp8 --conv_fwd='(mode=rtc)' --run_cnet='()' --run_cnet_dense='()')
(test_name=test_upsamp_1_nvrtc,command=(mode=test_upsamp,boda_output_dir=%(test_name),imgs=(mode=test_upsamp,boda_output_dir=%(test_name)),run_cnet=(mode=test_upsamp,boda_output_dir=%(test_name),in_dims=(img=1,x=516,y=516),out_node_name=cccp8,enable_upsamp_net=1,conv_fwd_upsamp=(mode=rtc,op_tune=(tconv=1))),wins_per_image=3,mrd_toler=0.0002),cli_str=boda test_upsamp --model-name nin_imagenet_nopad --wins-per-image=3 --run-cnet='(in_dims=(img=1,y=516,x=516),enable_upsamp_net=1,out_node_name=cccp8,conv_fwd=(mode=rtc),conv_fwd_upsamp=(mode=rtc,op_tune=(tconv=1)))')
WARNING: skipped some tests due to missing features: num_skipped=8 missing_needed_features=octave
TIMERS:  CNT     TOT_DUR      AVG_DUR    TAG  
          10      24.499s       2.449s    test_cmds_cmd
           8       8.588s       1.073s    nvrtc_compile
       18864    253.126ms      0.013ms    cu_launch_and_sync
          10     54.664ms      5.466ms    diff_command
           3     71.288ms     23.762ms    ocl_compile
           3     28.756ms      9.585ms    read_pascal_image_list_file
         163     43.270ms      0.265ms    caffe_copy_layer_blob_data
         668     58.473ms      0.087ms    img_copy_to
         668       1.055s      1.579ms    subtract_mean_and_copy_img_to_batch
          20       2.742s    137.111ms    dense_cnn
         668    172.312ms      0.257ms    conv_pipe_fwd_t::set_vars
         668      13.885s     20.787ms    conv_pipe_fwd_t::run_fwd
         668    115.286ms      0.172ms    conv_pipe_fwd_t::get_vars
         588       7.039s     11.971ms    sparse_cnn
          30       2.837s     94.570ms    net_upsamp_cnn
          30    195.558ms      6.518ms    upsample_2x
          30       1.661s     55.391ms    img_upsamp_cnn

real    0m24.815s
user    0m39.660s
sys 0m3.856s
Thu Jul  5 10:21:35 PDT 2018
moskewcz@mazda5:~/git-work/boda/run/tr1$

moskewcz commented 5 years ago

in particular, you might play with the setting(s) for cc_opts_arch and/or the other options set where cc_opts_arch is used. it's been a bit brittle in the past to find a good default for this, and i'm semi-convinced that nvrtc doesn't really handle all the arch options right -- or at least it didn't! maybe these need to be cuda-version dependent now if nvidia 'fixed' something in the nvrtc options handling ...

string cc_opts_arch; //NESI(default="--gpu-architecture=compute_60",help="this entire string will be passed (unchanged) to the nvrtc compiler phase as an option")

moskewcz commented 5 years ago

oh, and as per the comment, although this is an 'option', there's currently no good way to globally set/configure it ... if need be some system for that (env vars? boda config file entry? etc?) could be introduced.

aryamazaheri commented 5 years ago

You are right. I had to change the arch parameter to get it working. I also realized that maaya's configuration/GPUs have been changed. It would be nice to be able to select the arch parameter (cc_opts_arch) based on the given GPU, in the future.

moskewcz commented 5 years ago

yeah, either a better way to set the arch and/or an 'auto' setting would be good. i guess it could be as simple as using some cuda driver APIs to get the compute capability for the current device -- you're welcome to open an issue to track this!

moskewcz / boda

Cuda 9.2 PTX issue #29