Open HanskAlan opened 4 years ago
Cuda error failure is an out-of-memory error. Currently, TASO reserves around 6 GB of GPU memory for the cost model, which may cause OOM failure on some GPUs. You can reduces the reserved memory by changing L53 and L56 at https://github.com/jiazhihao/TASO/blob/master/include/taso/ops.h#L53-L56.
For the assertion failure, can you please check that you have successfully set the TASO_HOME
environment, and you can find graph_subst.pb
under TASO_HOME
.
Thank you for your patiently answer,but I still have some questions.
When I followed the instructions described here
after building the Xflow runtime I found that there is no directory named python
under xflow
,and I found that there is a __init__.py
in the xflow
directory, so I set the parent directory of xflow
as the$PYTHONPATH
,I think it should be correct since I can reproduce the experiment on our Tesla P100 platform,but on our 2080 ti platform it failed.Both platform has GPU memory over 10 GB, so I don't think the size of memory is the problem. The only difference I can think of is that I use ~/.zsh
instead of ~/.bashrc
on the 2080ti platform,which is the failed one,but I don't think it is the cause of my problem.
It seems you are using the old instructions for install XFlow. You can find the up-to-date install instructions at https://github.com/jiazhihao/TASO/blob/master/INSTALL.md. Note that you no longer need to export PYTHONPATH
in the new instructions.
Let me know if you still see the same errors after installing TASO w/ the new instructions.
Thank you for your help, and now I 'm able to perform the experiement I mentioned before, and now I have several problems:
1.When I 'm trying to measure the performance of nasrnn
on our p100 platform it failed at the cosed-based search phase,and it gives messages like this:
python3: /home/user/hanskalan/taso/src/core/ops.cc:516: taso::Graph* taso::Graph::preprocess_weights(): Assertion `it->srcOp.ptr->outputs[it->srcIdx].data_ptr != NULL' failed.
已放弃 (核心已转储)
but it works fine on our 2080ti platform,which gives 4692 candidates,but it fails on our p100 platform when it gives about 8000 candidates, this is really confusing to me.
I wonder is there any differences between the codes in TASO and XFLOW?
Thanks for the great work, recently I 'm trying to do the experiments described in your docs, everything works fine except the experiments measuring the inference latency between mataflow and taso. It fails when I perform the experiments on our 2080ti platform,which gives information like this:
but when I do the same experiment on our Tesla P100 platform with the same configuration(at least I have export the same environment variables in the
~/.bashrc
and/etc/profile
),I can successsfully execute the following command:python3 examples/model.py
and get the expecting result, but when I enter theexamples
directory and executepython3 model.py
I get the following fail message:By the way,I 've also tried executing
python3 examples/model.py
on the 2080ti platform but it also fails with the same error message as before.I wonder is there anything wrong when I reproduce the experiment ?Thank you.