jiazhihao / TASO

The Tensor Algebra SuperOptimizer for Deep Learning
Apache License 2.0
687 stars 90 forks source link

Some problem happened on different experiment platforms #33

Open HanskAlan opened 4 years ago

HanskAlan commented 4 years ago

Thanks for the great work, recently I 'm trying to do the experiments described in your docs, everything works fine except the experiments measuring the inference latency between mataflow and taso. It fails when I perform the experiments on our 2080ti platform,which gives information like this:

python3 nasrnn.py

Cuda failure: 2
/home/edge/hanskalan/sosp19ae/src/cudnn/ops_cudnn.cu:51
Aborting...

but when I do the same experiment on our Tesla P100 platform with the same configuration(at least I have export the same environment variables in the ~/.bashrc and /etc/profile),I can successsfully execute the following command:python3 examples/model.py and get the expecting result, but when I enter the examplesdirectory and executepython3 model.py I get the following fail message:

python: /home/user/hanskalan/sosp19ae/src/core/substitution.cc:312: static void XFlow::GraphXfer::load_graph_xfer_from_pb_file(XFlow::Model*, std::vector<XFlow::GraphXfer*>&, std::__cxx11::string): Assertion `collection.ParseFromIstream(&input)' failed.
已放弃 (核心已转储)

By the way,I 've also tried executingpython3 examples/model.py on the 2080ti platform but it also fails with the same error message as before.

I wonder is there anything wrong when I reproduce the experiment ?Thank you.

jiazhihao commented 4 years ago

Cuda error failure is an out-of-memory error. Currently, TASO reserves around 6 GB of GPU memory for the cost model, which may cause OOM failure on some GPUs. You can reduces the reserved memory by changing L53 and L56 at https://github.com/jiazhihao/TASO/blob/master/include/taso/ops.h#L53-L56.

For the assertion failure, can you please check that you have successfully set the TASO_HOME environment, and you can find graph_subst.pb under TASO_HOME.

HanskAlan commented 4 years ago

Thank you for your patiently answer,but I still have some questions. When I followed the instructions described here image after building the Xflow runtime I found that there is no directory named python under xflow,and I found that there is a __init__.pyin the xflow directory, so I set the parent directory of xflowas the$PYTHONPATH,I think it should be correct since I can reproduce the experiment on our Tesla P100 platform,but on our 2080 ti platform it failed.Both platform has GPU memory over 10 GB, so I don't think the size of memory is the problem. The only difference I can think of is that I use ~/.zsh instead of ~/.bashrc on the 2080ti platform,which is the failed one,but I don't think it is the cause of my problem.

jiazhihao commented 4 years ago

It seems you are using the old instructions for install XFlow. You can find the up-to-date install instructions at https://github.com/jiazhihao/TASO/blob/master/INSTALL.md. Note that you no longer need to export PYTHONPATH in the new instructions.

Let me know if you still see the same errors after installing TASO w/ the new instructions.

HanskAlan commented 4 years ago

Thank you for your help, and now I 'm able to perform the experiement I mentioned before, and now I have several problems: 1.When I 'm trying to measure the performance of nasrnn on our p100 platform it failed at the cosed-based search phase,and it gives messages like this:

python3: /home/user/hanskalan/taso/src/core/ops.cc:516: taso::Graph* taso::Graph::preprocess_weights(): Assertion `it->srcOp.ptr->outputs[it->srcIdx].data_ptr != NULL' failed.
已放弃 (核心已转储)

but it works fine on our 2080ti platform,which gives 4692 candidates,but it fails on our p100 platform when it gives about 8000 candidates, this is really confusing to me.

  1. I tried to measure the performance of tensorflow and tensorflow--xla on our 2080ti and p100 platform, and I succeed when I was using the old version of taso,which is xflow.But when using taso ,for some reason I a lot of demo fails,here are the results:

image

I wonder is there any differences between the codes in TASO and XFLOW?