Hi folks,
This library is accelerated using assembly. Have you made any performance comparison of using cgo for corresponding acceleration? Although cgo itself has a remarkable overhead around 100ns for each call, it might be negligible given larger batch inputs. On the other hand, assembly itself also has the shortcomings such that it could not be inlined, so it would be interesting if performance comparison is available. Thank you~
Hi folks, This library is accelerated using assembly. Have you made any performance comparison of using cgo for corresponding acceleration? Although cgo itself has a remarkable overhead around 100ns for each call, it might be negligible given larger batch inputs. On the other hand, assembly itself also has the shortcomings such that it could not be inlined, so it would be interesting if performance comparison is available. Thank you~