ssm-lang / Scoria

This is an embedding of the Sparse Synchronous Model, in Haskell!
BSD 3-Clause "New" or "Revised" License
4 stars 0 forks source link

Efficiency of function pointers #22

Open Rewbert opened 3 years ago

Rewbert commented 3 years ago

This is something @koengit & I spoke about that we want to discuss in the next meeting, so @sedwards-lab & @j-hui can help us understand the situation.

Function pointers are expensive to use since we need to load the unknown code before we can execute the function. Normal calls are easy since the code we are jumping to is already known, so it can be loaded in advance. The type of sv_t right now is this struct:

typedef struct {
  /* Generic SV fields */
  void (*update)(sv_t *);
  struct trigger *triggers;
  peng_time_t last_updated;
  peng_time_t event_time;
  void (*to_string)(sv_t *, char *, size_t);
} sv_t;

The function update is here as a function pointer as that is the only function the runtime system will need to call. All other functions are called by the generated programs (assign_type, later_type etc). When we spoke about these things, it was stated that dereferencing such a function pointer is an expensive operation, and an alternative is to instead insert a tag in the struct that can be used with a switch statement to determine which function to actually call. We will always have a statically known, finite number of these different update functions. E.g update_bool and update_in32 is essentially the same function (as they both copy a 32bit value from one field to another). I think we can get away with a very small number of different functions. This also brings up the question of possibly change the representation of values in sv_ts from a field of a specific type to a field and a size?

The difficulty with generating code for polymorphic functions is that we need to know the type of the thing we are working with in order to generate the proper code (E.g later_int instead of later_bool, when called with an Int). A quick fix is to just put all the type specific functions in the struct as function pointers (instead of just update). Then the type doesn't matter, I guess? However, this incurs additional costs every time we dereference these function pointers. Could we use tags for all of these functions?

How can we prototype this and measure?

Rewbert commented 3 years ago

Information from @svenssonjoel (STM32): Memory access: 1 cycle Flash memory access (where functions would live): Up to 8 cycles

Joel also added: "A tidbit of info, those 8 cycles depend on how fast you clock the MCU. When the STM32 starts up it runs at 16mhz. In this state, flash access is also one cycle"

If flash access is 8 cycles, even a small switch case would be enough to perform worse. We need to try this out and evaluate it :)

j-hui commented 3 years ago

A quick fix is to just put all the type specific functions in the struct as function pointers (instead of just update). Then the type doesn't matter, I guess? However, this incurs additional costs every time we dereference these function pointers. Could we use tags for all of these functions?

My sched fork on @sedwards-lab 's upstream runtime actually pushes the assign and later methods into function pointers as well, originally to support to separately scheduling assignments to different components in an aggregate data structure (see the sched-aggregate branch).

Stephen pointed out that this is inefficient, and I agree, but I'm also curious to know: how inefficient this is in practice, how aggressively we are able to optimize this away, and how far inlining type-specialized code will get us. However, we had to punt this issue, because without an easily extensible suite of example programs (related to #12), exhibiting interesting language patterns (related to #19), we don't have anything to benchmark.

I suspect that if we use tags, we'll end up with something even slower, since that'll involve a function pointer dereference and a switch statement. But what we'll get in return is much lower memory usage (since we only need, say, 8 bits to support 256 distinct update types). Personally, I'm more in favor of representing types as storage classes and only storing their size, but it's hard to argue for that scientifically without the ability to benchmark things.

j-hui commented 3 years ago

If flash access is 8 cycles, even a small switch case would be enough to perform worse.

Is this to say that a switch statement would perform worse than a function pointer dereference (i.e., a jump)?

CC: @svenssonjoel

Rewbert commented 3 years ago

That's how I interpret it, but I am not an expert in this stuff.

j-hui commented 3 years ago

This also brings up the question of possibly change the representation of values in sv_ts from a field of a specific type to a field and a size?

If I understand what you're saying, you're pointing out that for operations like assign and update, we only need a size to implement the update function essentially as a memcpy operation.

In the sched-aggregate branch, I basically did something like this, where types basically represented storage classes (i.e., indicates a size), so that I could do things like generically update data within some offset of a struct. This seemed to work well with C's notion of structs simply representing some memory layout. But I ran into some snags with considering disjoint union types, e.g., something like Either Int32 UInt64. In my opinion, whatever we propose must also efficiently support that (but again, unfortunately our language does not support that kind of type yet).

Rewbert commented 3 years ago

Sounds like you already thought a lot about these things :p

Perhaps for measuring, we can generate some 100 tests and save them so that we can rerun them? We should get a good spread of programs that try out a lot of things.

sedwards-lab commented 3 years ago

Function pointers are expensive compared to inline code. For complex functions, the cost of calling a function pointer is not important, but for something like "update," which is just a data copy, the cost of following a function pointer dwarfs the cost of doing the transfer itself. This is mostly what I meant.

In general, two things slow down modern processors: unpredicted branch destinations and cache misses. Modern branch predictors might be able to handle function pointers, but I'm not sure. It will vary by processor. Caches, also, are highly dependent on the particular processor in question. Fancier systems will be more resistant to these things, but for the sort of resource-limited processors we're targeting, the caches may be small or absent, and there may be little or no branch prediction.

A switch statement probably doesn't help much since you still need to read the data for the predicate (i.e., where to go) and the underlying generated code is fairly similar to calling a function pointer (look up an address in a table and jump to the contents of a register).

The main point is to look at the size of the function/code you're dispatching: if it's really tiny, like a word copy, the cost of calling it can swamp it.

j-hui commented 3 years ago

The main point is to look at the size of the function/code you're dispatching: if it's really tiny, like a word copy, the cost of calling it can swamp it.

For the update function specifically, what if instead of calling a function pointer, we instead stored the size and performed a memcpy in the scheduler? Since memcpy is type-agnostic, it can be inlined, and saves us a function pointer jump in exchange for some logic around the size of the data being copied.

Rewbert commented 3 years ago

That's one alternative, but then we can't e.g implement an update function with side effects (turning on a LED). We'd have to come up with some other way of doing that.

svenssonjoel commented 3 years ago

Yeah, inline code code would be best. As I understand it there is a kind of I-cache on the STM32 (probably NRF52 as well) to deal with the slower access to flash. Following a function-pointer would likely be an i-cache miss. One trick I heard of as a "common speed hack" is to relocate critical code into sram. Ram is uncached, I think, in the F4 but access is fast and uniformly so.

But all that said, if possible it would of course be most efficient with inline code.

Disclaimer: not an expert

j-hui commented 3 years ago

we can't implement e.g. implement an update function with side effects (turning on an LED).

I think we have other options for side-effectful updates that are more efficient. @koengit mentioned a listener pattern where we have special designated processes listening for updates on each output ref, and forwarding updates to the external environment whenever those variables are updated. An optimization on that is to introduce an "output trigger" that is responsible for the same thing. Another completely different approach would be to use type classes to figure out which call sites need to additionally call some output handler (though this might not be possible in an EDSL).

sedwards-lab commented 3 years ago

Postponing side-effects for updates is almost certainly slower than dispatching something special when the update is known. What we want is use function pointers when we have to (e.g., when we know that something might have a side-effect) and avoid them when we can. One way to do this would be to use typeclasses but insist that they are inlined where possible.

Typeclasses are exactly function-pointer-like things and something I definitely want in the language (basically, type-safe virtual methods), so perhaps we should be thinking along those lines. I like the Haskell approach of making everything potentially virtual then resolving and inlining nearly everything at compile-time.

Rewbert commented 3 years ago

So you mean implementing type classes in C? I've never seen that, do you have any recommended reading? :)

sedwards-lab commented 3 years ago

Type classes are just like virtual methods: you pass a pointer to a collection of function pointers and invoke the appropriate method/function as needed. It's like doing virtual methods in C: every object has a pointer to a virtual table object, which is just a struct full of function pointers. At a call site, you follow the pointer and invoke the function through the pointer.

j-hui commented 3 years ago

It doesn't have to be all function pointers though, right? If you insist on monomorphizing all functions at compile time (which I believe is what Rust does by default) you can recover the concrete type, which allows you to inline the type-specialized function.

sedwards-lab commented 3 years ago

If you know the concrete type at compile-time, it becomes essentially constant propagation and you don't need a virtual table or a function pointer dispatch at all.

Rewbert commented 3 years ago

Monomorphization would be a lot of work but would solve this issue (if we pursue something like this, definitely after the paper). It would produce much bigger code though, as we would duplicate code only to please the type system. So there's a trade of between efficiency and memory usage between the two approaches we discussed so far.