Performance considerations (discussion)

portnov commented 6 years ago

While trying to improve performance of "bend along surface" node, I did some research about about whether it worth it to move some part of code to C++ (you can see the code at https://github.com/portnov/python_cpp_excersises). Here are the results:

It is not so hard to integrate C++ and python nowadays. Keyword is "boost::python".
It is possible to eliminate data copying almost completely in most situations.
The most subtle thing is memory management. Python has it's own memory management, C++ has it's own "kind-of memory management". You must know at each moment who owns the particular pointer, who is gonna release it and when exactly.
For non-numeric code (such as "get data and put it to the map of some sort", "traverse that map" and so on), C++ code is about 3-4 times faster than very similar python code.
For vectorized numeric computations in C++, the keywords are "boost::ublas" and "eigen" (last one uses SIMD instructions when they are available).
For numeric code, C++ can be about 10 times faster than python code.
But, in many places, for numeric computations we already use numpy. Numpy is actually a C++ library with support of SIMD instructions. So, the performance of python-numpy code is about 1:1 to performance of C++ Eigen code (some functions are slower in numpy, other are slower in Eigen).

So my conclusions for now are:

It does not worth rewriting LinearSpline/CubicSpline/Spline2D classes in C++, as they are actually mostly written in C++ already (by usage of numpy). The main way to improve performance of nodes that use these is to process lesser amounts of data in most cases.
It could worth rewriting some code that processes a lot of data with non-numerical algorithms, in C++. But I do not know such places in Sverchok.
If you see some node / class in Sverchok, that does some numerical computations, and you want to improve it's performance - the step 1 is to use numpy.

Will write some other thoughts later...

zeffii commented 6 years ago

i'd love to see bmesh.ops.boolean(bm_a, bm_b, operation='UNION', material=False) implemented, get away from using python for that process.
uv_connect nodes, these things seem like they could be written in C/C++ and be much faster.
mesh join, mesh separate loose, maybe the math nodes (Scalar/Vector)..

The downside to moving the essential nodes to a compiled language is it reduces the number of people who can easily modify them (for their own use, or common use)..

zeffii commented 6 years ago

we have a few script nodes that do cool stuff but are not fast, like 'random points on polygons' / 'best_candidate_rect_random_points`, 'topo sort for profile', '3d voronoi' (some SNmk1, some SNLite)

zeffii commented 6 years ago

i think topo sort would be an interesting example to write / call in C++.

portnov commented 6 years ago

Some ideas:

Have a "Draft" mode of node tree:
- Enabled/disabled by a toggle button near "Update tree".
- Input nodes, such as "A Number", have (and store) separate value for Draft mode (which is usually smaller, but it depends). It is displayed and used instead of main value, when the tree is in Draft mode.
- Some nodes have (and store) separate values of their settings for Draft mode.
- All nodes have programmatic access to this mode (smth like if self.tree.draft:). In draft mode, complex nodes execute some simplified version of their algorithm.
Do lesser number of tree updates:
- Have a setting in addon preferences, how often to recalculate the tree: On any update / On update of node settings / Only manually.
- Have a setting: When recalculating by scene change event, use Draft mode.
- Nodes should detect whether they need processing, more precisely. For that:
- Introduce a new method in base node class, need_process(self). Default implementation does return any(socket.is_required for socket in self.outputs).
- Introduce a new property in base Socket class, is_required, which returns self.is_linked and self.other.need_process().
- Viewer nodes return state of their UPD button in the need_process() method.
- So, if one disables the viewer in the end of pipeline, then the whole pipeline will see that it is disabled and do not execute process().

portnov commented 6 years ago

@zeffii I did some investigations about topo_sort:

Existing implementation is slow by itself;
It does not work correctly for arbitrary graph;

I implemented Tarjan's algorithm for topological sort (see https://github.com/portnov/python_cpp_excersises/blob/master/benchmark.py#L31) for compassion. This implementation is limited by maximum recursion level, but it is O(N).

There is boost::graph C++ library, and it has topological_sort (https://github.com/portnov/python_cpp_excersises/blob/master/greet_binding.cpp#L251). This method sorts 100000 vertices in about 0.15 seconds on my machine; I did not long last for end of our old algorithm on that amount of data. Tarjan's algorithm on 1000 vertices is about 3 times slower in Python than C++/boost.

portnov commented 6 years ago

"mesh separate loose" node appears to be written relatively good. On the same graph of 20000 vertices it's python code takes 1.8 seconds, while C++ code from boost::graph takes 1.1 seconds.

DolphinDream commented 6 years ago

Btw, I don't think nodes cache their calculation, do they? i.e. if the inputs are the same and the parameters of a node do not change should they recompute the outputs constantly during animation ? Maybe there is a tradeoff between figuring out if the inputs are the same vs recomputing the outputs, but for some nodes caching may have a performance impact.

zeffii commented 6 years ago

oh. portnov. i would love a C++ (or much faster python) implementation of Intersect Edges :) k thx. haha

portnov commented 6 years ago

@DolphinDream in most cases no, they don't. The main problem in adding "universal caching" now is that most nodes work on the whole set of data, though internally in many cases they process data item-by-item. For example, let's consider Math node with Add operation, where some numbers source is connected to X input, and Y is set to just 1. Imagine we have this on the X input:

[1, 2, 3, 4]

Then node calculates result = [2,3,4,5]. From a first glance, it looks like we can have a cache = dict(), and do cache[[1,2,3,4]] = [2,3,4,5]. Well, it will even work. Sometimes. But it will have the following problems:

Keys in such cache can be very large (how many items can you feed into node? 10000? 100000?...). This will lead to 1) huge memory usage and 2) degradation of dict() performance, because there will be many keys with the same hash.
Imagine on the next step only something changed in the input, and now the X input is [1,2,7,4]. We will see that we do not have such key in cache, so we will have to calculate everything again. Though if we would a bit more clever, we could re-calculate only for 7, and use previously calculated values for 1,2 and 4.

We could instead put a caching logic into main loop of each node, and implement correct caching with respect to specific of this node's algorithm. But that would be a lot of code in each node, which is required to 1) write and 2) maintain. I do not see us having that amount of resources currently.

In our Holy Grail, the SverchokRedux, all processing loops of nodes are factored out from nodes to general update mechanism, so each node processes only 1 item at a time. There we could implement a "generic caching mechanism" without each node being aware about caching.

DolphinDream commented 6 years ago

@portnov Actually I was not thinking of having the data itself used as the dictionary keys :)

Let me clarify a little bit. A node has a set of input sockets, properties and output sockets. The output data is generally computed based on input sockets + property values. A node could cache its output data once computed and invalidate this cache any time the input changes from one update call to another. The cache could be done as inputCache[“socketA”] = socketAData... and with every update call the node itself (or perhaps via functionality implemented in a base class) will check if the current socketA data is the same as the cached value for that socketA. If all input socket data is the same as the previously cached values (and the cached node properties) then the node simply returns the cached output values to the output sockets .. outputCache[“socketX”], otherwise, if any input socket or properties changed then the output cache is invalidated and the node would have to recompute the output data and store it in the output cache for the next update cycle.

note: The keys for the cached values can be the socket names (assumed unique) or property names.

One key aspect you need to consider is a quick way to determine if an input changes (perhaps via some quick sort method or hashing etc). Alternatively, a different mechanism that could avoid having to figure out if the input data to a node (passed from another node output socket) changes is to assign a unique ID for the cached output data within a node.. this way the receiving node could simply ask the connected transmitting node the ID of the data it’s sending via the link. This way the receiving node can simply cache the value of the receiving data unique ID .. and if that ID changes then the receiving node knows that it needs to invalidate its output cache since the input data changed.

Example: sv-cachingmockup-demo1

Constant IO nodes:

The Random Vector stores a cache of its properties, computes its output (10 random vectors) and stores it in a outputCache[“Random”] = listOfVectors and generates a unique ID for that data outputID[“Random”] = someUniqueID.
The UV Connection node, during its update cycle, will ask the connected input node (RandomVector) for the ID of the connected socket data.. and if it is different than the previous ID (cached on the UV node) then the UV node knows that the output of the Random Vector changed. (this way the UV node doesn’t have to explicitly store the input data in an inputCache and check fi the new input data is different than the data received during the last update cycle. The ID change itself can be used as a criteria for the input data being different.

Variable IO nodes:

The “Matrix Apply” node, during its update cycle, will ask the nodes connected to its input socket (e.g. “UV Connection” node) what are the IDs of the data they will transmit through the connected links.. and compare those IDs to the previous IDs cached for its input sockets. If all input socket IDs are the same as the previously cached ones and the node properties did not change then the node knows that the old output data (cached) is still valid and returns that to the output sockets. In this example, the Vertices and Edges socket data IDs will be the same as the previous IDs (nothing changes upstream), while the Matrixes socket data ID will change since the “Matrix In" node generates new matrices every update cycle. Since at least one of the input socket data changed the “Matrix Apply” node will invalidate its output data and recompute it (and it will also store it again in the outputCache) before sending it to its output sockets.

Other notes:

of course, using the unique ID scheme for the newly generated outputs is not a guarantee that the output is necessarily the same (e.g. a node can compute the same output values for different input values), though i suspect it is highly likely in general you would have a different output if the input is different. To be 100% sure you have a different input to a socket you would have to check the data itself against a previously cached value.
a smarter/more effective approach to output data invalidation is to leave it to the node discretion to determine what output data needs to be invalidated based on what input has changed. i.e. it may not be necessary to invalidate ALL output data if certain input has no bearing on some output data.

portnov commented 6 years ago

Well, this approach in general can work. Only concern is estimated amount of labor that we need to invest in this versus estimated speedup. I'm not ready to say what is bigger right now, need to think. @zeffii any thoughts about approach suggested by @DolphinDream ?

DolphinDream commented 6 years ago

@portnov I agree. Most of the nodes would not get a significant boost from this kind of optimization. Some nodes that have heavier computation may do, depending on the input size etc.

zeffii commented 6 years ago

rather than making caching the focus, id prefer to just eliminate any redundant tree updates. The end result is practically the same with much less memory consumption. Some nodes could cache partially for example if verts+edges is unchanged and gets converted into a bm, where some operation is applied to each face. The bm can be cached, and still let another socket pass changed data (like extruding each face with a unique value).

but it's really down to individual nodes.

portnov commented 6 years ago

@zeffii smth like need_process() method (https://github.com/nortikin/sverchok/issues/1990#issuecomment-354586202 p.2) ?

DolphinDream commented 6 years ago

Btw, i know that python creates compiled versions of the py scripts as pyc files. Are these as fast as cython ? I know the AN has created a cython version of their addon to speedup the addon. Is that possible to do with SV ?

zeffii commented 6 years ago

yeah, @portnov exactly that.

portnov commented 6 years ago

@DolphinDream 1) no, pyc files are not actually optimized, they actually exist only because parsing of "bytecode" is faster than parsing of python source itself. 2) It is possible of course, the question is only required amount of efforts.

portnov commented 6 years ago

Hmm. It looks like we have something like @DolphinDream suggested, not sure if it works. @zeffii , @DolphinDream please look at core/socket_data.py.

zeffii commented 6 years ago

The term cache there is perhaps a misnomer depending on the perspective. We also have a .has_changed which already suppresses many recalcs. but it applies only to tree's not nodes. A .has_changed for individual nodes and especially for "terminal" nodes could still be interesting to explore.

zeffii commented 6 years ago

the socket.sv_get(...., deepcopy=False) switch for sockets deserves a few paragraphs / code examples to show how it can speed up the node's processing duties, but also explain the implications of letting the data into the node with the explicit agreement that the node won't modify that data - instead (lazily) read from it / iterate over it to generate new data.

portnov commented 6 years ago

There is also some fancy "partial update" thing, at it even seems to work somehow. If I change a property in a middle of the tree, only this node and subsequent ones are called for .process().

zeffii commented 6 years ago

yes. I think the double update you experienced before digging deeper into this, was caused by the update that an object-viewer node causes after updating Scene , scene updates are also going to trigger a global update in the nodetree.

portnov commented 6 years ago

Regarding https://github.com/nortikin/sverchok/issues/1990#issuecomment-355419718 . I had a quick look at that node, and at a first glance it appears to have not very good algorithms and maths. On my examples I can reduce execution time from 12 seconds to 3 seconds with relatively small amount of code changes. Will do a PR shortly.

zeffii commented 6 years ago

@portnov cool. as long as the Intersect Edges algorithm is fully 3d, it can not assume to ignore a 3rd component like z.

portnov commented 5 years ago

About topo_sort node and it's algorithm... https://habr.com/ru/post/451208/ (in russian)

nortikin commented 5 years ago

@portnov from your link: https://github.com/slonopotamus/stable-topo-sort/blob/master/test/StableTopoSort.java#L16 https://e-maxx.ru/algo/strong_connected_components Асакхабиев молодец

nortikin / sverchok

Performance considerations (discussion) #1990