tensor-compiler / array-programming-benchmarks

MIT License
3 stars 0 forks source link

Double-check function invocation overhead for NumPy and PyData/Sparse #19

Open stephenchouca opened 3 years ago

stephenchouca commented 3 years ago

At some point (if feasible), it might be a good idea to measure how much overhead NumPy and PyData/Sparse incur when functions are invoked. That way, we can verify that any observed difference between NumPy and TACO is not just due to the overhead of interpreting Python functions or something trivial like that. This might require modifying NumPy's and PyData/Sparse's source code to insert some timers inside the functions we are benchmarking.

rohany commented 3 years ago

As per the numpy documentation:

In NumPy, universal functions are instances of the numpy.ufunc class. Many of the built-in functions are implemented in compiled C code. The basic ufuncs operate on scalars, but there is also a generalized kind for which the basic elements are sub-arrays (vectors, matrices, etc.), and broadcasting is done over other dimensions. One can also produce custom ufunc instances using the frompyfunc factory function.
stephenchouca commented 3 years ago

When we benchmarked sparse tensor addition in TensorFlow (which also implements kernels in compiled code) for a previous paper, what we found was that maybe only ~10% of the runtime was actually spent on the addition while the rest was split between validating the inputs and actually dispatching the compiled code. NumPy and PyData/Sparse might not have similar issues, but that's something we might want to double check at some point.

rohany commented 3 years ago

I see. We should definitely check then.

weiya711 commented 3 years ago

I was talking to Fred about this on Thursday and we both agreed that we definitely need to check this. It will be useful for the analysis of "Why is TACO faster than Numpy/PyData?" and "What're the causes of the speedup and what's the breakdown contributing to the speedup/slowdown of the systems?" etc. We believe that since PyData/Sparse is still using the dense Numpy calls, it's spending most of the time breaking down the tensor into dense vectors for each iteration space. That code is currently (most likely) implemented in Python so we need to be able to pinpoint what PyData/Sparse is doing (i.e. how much of it is due to Python interpretation, how much is caused by the splitting/re-merging algorithm, and how much is due to the Numpy function call).

rohany commented 3 years ago

I looked into this a little bit. Here is the flamegraph for running logical_xor (rename the file to .svg and then view it in your browser):

ufunc_stats_fg.txt

It doesn't look like there is any sort of interpretation overhead. pydata/sparse spends most of its time in some list comprehension, followed by jit compiling and calling into some generated code via numba. I'm not very familiar with how pydata/sparse actually does the sparse ufunc operation, so I can't attempt to match these up.