Some quick benchmarks show that a FPU store+restore takes about 32ns on my system (i7 4th gen 3.6GHz).
A single FPU store requires about 16ns.
Pingpong IPC requires two context switches, which means a 64ns overhead on FPU state.
Doing a lazy FPU store + restore the classic way requires disabling the FPU and relying on an ISR to re-enable and restore.
However, this approach has more overhead than what you get back (> 32ns) as I measured.
Besides the overhead issue, there's also the lazy FPU restore vulnerability. If a lazy mechanism is implemented, it can only be applied to storing the FPU state: the current task must always have its own FPU state loaded.
An alternative method to implement FPU lazy store is relying in the compiler.
If we could detect the use of FPU instructions in a basic block, we could insert an instruction in that basic block which sets a flag, something like this:
movb $1, %fs:0 // With fs the TLS register, assuming offset 0 is the flag offset
This would mean that the hardware overhead for lazy FPU store is replaced by a single move instruction, which should be pretty cheap.
Some quick benchmarks show that a FPU store+restore takes about 32ns on my system (i7 4th gen 3.6GHz). A single FPU store requires about 16ns. Pingpong IPC requires two context switches, which means a 64ns overhead on FPU state.
Doing a lazy FPU store + restore the classic way requires disabling the FPU and relying on an ISR to re-enable and restore. However, this approach has more overhead than what you get back (> 32ns) as I measured. Besides the overhead issue, there's also the lazy FPU restore vulnerability. If a lazy mechanism is implemented, it can only be applied to storing the FPU state: the current task must always have its own FPU state loaded.
An alternative method to implement FPU lazy store is relying in the compiler. If we could detect the use of FPU instructions in a basic block, we could insert an instruction in that basic block which sets a flag, something like this:
This would mean that the hardware overhead for lazy FPU store is replaced by a single move instruction, which should be pretty cheap.