RFC: strided array function interfaces

I’ve got a proof-of-concept strided array interface supporting multiple dispatch (i.e., an interface which delegates to a type-aware implementation) for computing the absolute value, with N-API support. See https://github.com/stdlib-js/stdlib/tree/62cb2f0ea67ad09deb990d6e86e236785c848856/lib/node_modules/%40stdlib/strided/math/special/abs.

What does this mean? This is part of an effort to bring something comparable to NumPy’s ufuncs ("universal functions"; a.k.a., element-wise functions).

What would this allow? This would allow us to provided strided interfaces to all of our base mathematical functions (all ~160 of them) and any other applicable numerical API (including distributions). And would allow us to be another step closer to something resembling NumPy capabilities.

What is needed? Would be great to get initial design feedback from @Planeshifter and @rreusser (and anyone else for that matter) so we can figure out the best approach for ensuring that strided interfaces can be added to stdlib with as little work as possible and for ensuring that we can minimize the possibility of future refactoring. Once we begin rolling out strided array interfaces, it may become significantly harder to make changes due to the need to update many packages. So would be good to have confidence that we’ve nailed down the design now, rather having to do a large refactor later.

I am particularly interested in feedback on the JavaScript side of things: https://github.com/stdlib-js/stdlib/tree/62cb2f0ea67ad09deb990d6e86e236785c848856/lib/node_modules/%40stdlib/strided/math/special/abs/lib

…and on the C side of things: https://github.com/stdlib-js/stdlib/tree/62cb2f0ea67ad09deb990d6e86e236785c848856/lib/node_modules/%40stdlib/strided/math/special/abs/src/addon.cpp

On the C side of things, one thing I’d like to achieve is that a strided interface need only create the StridedFunctionObject. The add-on namespace could be hidden away in a macro/interface which an author would include at the very end, similar to how N-API works for performing module initialization.

Update: I've implemented a macro which allows add-on authors to omit any N-API logic.

On the JavaScript side of things, I am not sure how we get around needing several files with source documentation. All the files are fairly short. The only "real" code is that which wraps the native add-on interfaces, which is necessary for calling into C when we’ve been provided generic arrays. Source code documentation is useful for JSDoc documentation. Having the interface types in types.json is convenient for dynamically generating tests and benchmarks (i.e., having the ability to require the types.json file and then dynamically creating arrays satisfying the type requirements).

We may want some way for a user to query what types are supported by a given strided API. What would this look like? In NumPy, you can do np.sin.types which returns a list of supported input types.

Update: I've added support for querying strided array function meta data (e.g., nargs, nin, nout, and types), similar to NumPy's ufuncs.

One thing to note is that I’ve not included support for automatic casting and/or detection of overlapping buffers. My sense is that these behaviors, if desired, can be supported by wrapping these "lower-level" interfaces and performing the casting and/or copies there.

Update: decided to punt these concerns to userland. A user can avoid overlapping buffer issues by manipulating the array strides (e.g., reversing them) depending on the direction of overlap.

Thoughts on all the above?

I'm adding some benchmarks here to help provide additional context and implications of the current work.

Here are the benchmarks for the strided array interface, as currently implemented:

Running benchmark: /Users/kgryte/Coding/projects/personal/stdlib/stdlib/lib/node_modules/@stdlib/strided/math/special/abs/benchmark/benchmark.ndarray.native.js
TAP version 13
# @stdlib/strided/math/special/abs::native:ndarray:len=10
  ---
  iterations: 1000000
  elapsed: 0.542590328
  rate: 1843011.1050560416
  ...
ok 1 benchmark finished
# @stdlib/strided/math/special/abs::native:ndarray:len=10
  ---
  iterations: 1000000
  elapsed: 0.546170667
  rate: 1830929.5251844784
  ...
ok 2 benchmark finished
# @stdlib/strided/math/special/abs::native:ndarray:len=10
  ---
  iterations: 1000000
  elapsed: 0.54911345
  rate: 1821117.29370315
  ...
ok 3 benchmark finished
# @stdlib/strided/math/special/abs::native:ndarray:len=100
  ---
  iterations: 1000000
  elapsed: 0.825757
  rate: 1211010.0186858845
  ...
ok 4 benchmark finished
# @stdlib/strided/math/special/abs::native:ndarray:len=100
  ---
  iterations: 1000000
  elapsed: 0.820846319
  rate: 1218254.838759897
  ...
ok 5 benchmark finished
# @stdlib/strided/math/special/abs::native:ndarray:len=100
  ---
  iterations: 1000000
  elapsed: 0.822068907
  rate: 1216443.0396100604
  ...
ok 6 benchmark finished
# @stdlib/strided/math/special/abs::native:ndarray:len=1000
  ---
  iterations: 100000
  elapsed: 0.346216651
  rate: 288836.4834884848
  ...
ok 7 benchmark finished
# @stdlib/strided/math/special/abs::native:ndarray:len=1000
  ---
  iterations: 100000
  elapsed: 0.339341522
  rate: 294688.37002505106
  ...
ok 8 benchmark finished
# @stdlib/strided/math/special/abs::native:ndarray:len=1000
  ---
  iterations: 100000
  elapsed: 0.343990798
  rate: 290705.4507894133
  ...
ok 9 benchmark finished
# @stdlib/strided/math/special/abs::native:ndarray:len=10000
  ---
  iterations: 10000
  elapsed: 0.30209105
  rate: 33102.602675584065
  ...
ok 10 benchmark finished
# @stdlib/strided/math/special/abs::native:ndarray:len=10000
  ---
  iterations: 10000
  elapsed: 0.301467197
  rate: 33171.104848266456
  ...
ok 11 benchmark finished
# @stdlib/strided/math/special/abs::native:ndarray:len=10000
  ---
  iterations: 10000
  elapsed: 0.302045479
  rate: 33107.597018527136
  ...
ok 12 benchmark finished
# @stdlib/strided/math/special/abs::native:ndarray:len=100000
  ---
  iterations: 1000
  elapsed: 0.293969017
  rate: 3401.7190321795033
  ...
ok 13 benchmark finished
# @stdlib/strided/math/special/abs::native:ndarray:len=100000
  ---
  iterations: 1000
  elapsed: 0.287274552
  rate: 3480.9905473283966
  ...
ok 14 benchmark finished
# @stdlib/strided/math/special/abs::native:ndarray:len=100000
  ---
  iterations: 1000
  elapsed: 0.299388098
  rate: 3340.1461403452317
  ...
ok 15 benchmark finished
# @stdlib/strided/math/special/abs::native:ndarray:len=1000000
  ---
  iterations: 100
  elapsed: 0.306394113
  rate: 326.377028007715
  ...
ok 16 benchmark finished
# @stdlib/strided/math/special/abs::native:ndarray:len=1000000
  ---
  iterations: 100
  elapsed: 0.307819325
  rate: 324.86589332882204
  ...
ok 17 benchmark finished
# @stdlib/strided/math/special/abs::native:ndarray:len=1000000
  ---
  iterations: 100
  elapsed: 0.303846911
  rate: 329.1131039341058
  ...
ok 18 benchmark finished
#
1..18
# total 18
# pass  18
#
# ok

Running benchmark: /Users/kgryte/Coding/projects/personal/stdlib/stdlib/lib/node_modules/@stdlib/strided/math/special/abs/benchmark/benchmark.native.js
TAP version 13
# @stdlib/strided/math/special/abs::native:len=10
  ---
  iterations: 1000000
  elapsed: 0.405647482
  rate: 2465194.644053035
  ...
ok 1 benchmark finished
# @stdlib/strided/math/special/abs::native:len=10
  ---
  iterations: 1000000
  elapsed: 0.412639994
  rate: 2423419.9654432912
  ...
ok 2 benchmark finished
# @stdlib/strided/math/special/abs::native:len=10
  ---
  iterations: 1000000
  elapsed: 0.408355417
  rate: 2448847.1521855677
  ...
ok 3 benchmark finished
# @stdlib/strided/math/special/abs::native:len=100
  ---
  iterations: 1000000
  elapsed: 0.691814215
  rate: 1445474.7796704352
  ...
ok 4 benchmark finished
# @stdlib/strided/math/special/abs::native:len=100
  ---
  iterations: 1000000
  elapsed: 0.679503589
  rate: 1471662.5727785523
  ...
ok 5 benchmark finished
# @stdlib/strided/math/special/abs::native:len=100
  ---
  iterations: 1000000
  elapsed: 0.733919673
  rate: 1362546.9336614988
  ...
ok 6 benchmark finished
# @stdlib/strided/math/special/abs::native:len=1000
  ---
  iterations: 100000
  elapsed: 0.36152586
  rate: 276605.3858498532
  ...
ok 7 benchmark finished
# @stdlib/strided/math/special/abs::native:len=1000
  ---
  iterations: 100000
  elapsed: 0.326910484
  rate: 305894.1358393388
  ...
ok 8 benchmark finished
# @stdlib/strided/math/special/abs::native:len=1000
  ---
  iterations: 100000
  elapsed: 0.313741831
  rate: 318733.3983526092
  ...
ok 9 benchmark finished
# @stdlib/strided/math/special/abs::native:len=10000
  ---
  iterations: 10000
  elapsed: 0.291659503
  rate: 34286.55640272417
  ...
ok 10 benchmark finished
# @stdlib/strided/math/special/abs::native:len=10000
  ---
  iterations: 10000
  elapsed: 0.302755194
  rate: 33029.98659702598
  ...
ok 11 benchmark finished
# @stdlib/strided/math/special/abs::native:len=10000
  ---
  iterations: 10000
  elapsed: 0.298205107
  rate: 33533.96627107395
  ...
ok 12 benchmark finished
# @stdlib/strided/math/special/abs::native:len=100000
  ---
  iterations: 1000
  elapsed: 0.27920722
  rate: 3581.5692731728072
  ...
ok 13 benchmark finished
# @stdlib/strided/math/special/abs::native:len=100000
  ---
  iterations: 1000
  elapsed: 0.288315077
  rate: 3468.427702100366
  ...
ok 14 benchmark finished
# @stdlib/strided/math/special/abs::native:len=100000
  ---
  iterations: 1000
  elapsed: 0.297698562
  rate: 3359.102554213883
  ...
ok 15 benchmark finished
# @stdlib/strided/math/special/abs::native:len=1000000
  ---
  iterations: 100
  elapsed: 0.304977388
  rate: 327.8931617054835
  ...
ok 16 benchmark finished
# @stdlib/strided/math/special/abs::native:len=1000000
  ---
  iterations: 100
  elapsed: 0.298957976
  rate: 334.4951733283075
  ...
ok 17 benchmark finished
# @stdlib/strided/math/special/abs::native:len=1000000
  ---
  iterations: 100
  elapsed: 0.304128663
  rate: 328.8082057559961
  ...
ok 18 benchmark finished
#
1..18
# total 18
# pass  18
#
# ok

Running benchmark: /Users/kgryte/Coding/projects/personal/stdlib/stdlib/lib/node_modules/@stdlib/strided/math/special/abs/benchmark/benchmark.js
TAP version 13
# @stdlib/strided/math/special/abs:len=10
  ---
  iterations: 1000000
  elapsed: 0.338511632
  rate: 2954108.2357843467
  ...
ok 1 benchmark finished
# @stdlib/strided/math/special/abs:len=10
  ---
  iterations: 1000000
  elapsed: 0.327715766
  rate: 3051424.7520212377
  ...
ok 2 benchmark finished
# @stdlib/strided/math/special/abs:len=10
  ---
  iterations: 1000000
  elapsed: 0.330496007
  rate: 3025755.1644186736
  ...
ok 3 benchmark finished
# @stdlib/strided/math/special/abs:len=100
  ---
  iterations: 1000000
  elapsed: 0.691415536
  rate: 1446308.2588297522
  ...
ok 4 benchmark finished
# @stdlib/strided/math/special/abs:len=100
  ---
  iterations: 1000000
  elapsed: 0.678599439
  rate: 1473623.381524782
  ...
ok 5 benchmark finished
# @stdlib/strided/math/special/abs:len=100
  ---
  iterations: 1000000
  elapsed: 0.699995137
  rate: 1428581.3531301718
  ...
ok 6 benchmark finished
# @stdlib/strided/math/special/abs:len=1000
  ---
  iterations: 100000
  elapsed: 0.445000866
  rate: 224718.66380592616
  ...
ok 7 benchmark finished
# @stdlib/strided/math/special/abs:len=1000
  ---
  iterations: 100000
  elapsed: 0.443071861
  rate: 225697.02299374863
  ...
ok 8 benchmark finished
# @stdlib/strided/math/special/abs:len=1000
  ---
  iterations: 100000
  elapsed: 0.444303802
  rate: 225071.22277562684
  ...
ok 9 benchmark finished
# @stdlib/strided/math/special/abs:len=10000
  ---
  iterations: 10000
  elapsed: 0.772345405
  rate: 12947.574926013835
  ...
ok 10 benchmark finished
# @stdlib/strided/math/special/abs:len=10000
  ---
  iterations: 10000
  elapsed: 0.77681634
  rate: 12873.05568263407
  ...
ok 11 benchmark finished
# @stdlib/strided/math/special/abs:len=10000
  ---
  iterations: 10000
  elapsed: 0.77653087
  rate: 12877.788103903713
  ...
ok 12 benchmark finished
# @stdlib/strided/math/special/abs:len=100000
  ---
  iterations: 1000
  elapsed: 0.83744988
  rate: 1194.1013114719176
  ...
ok 13 benchmark finished
# @stdlib/strided/math/special/abs:len=100000
  ---
  iterations: 1000
  elapsed: 0.851412887
  rate: 1174.5182804591493
  ...
ok 14 benchmark finished
# @stdlib/strided/math/special/abs:len=100000
  ---
  iterations: 1000
  elapsed: 0.839253739
  rate: 1191.5347570468198
  ...
ok 15 benchmark finished
# @stdlib/strided/math/special/abs:len=1000000
  ---
  iterations: 100
  elapsed: 0.870049886
  rate: 114.93593828250901
  ...
ok 16 benchmark finished
# @stdlib/strided/math/special/abs:len=1000000
  ---
  iterations: 100
  elapsed: 0.857157195
  rate: 116.66471515764387
  ...
ok 17 benchmark finished
# @stdlib/strided/math/special/abs:len=1000000
  ---
  iterations: 100
  elapsed: 0.851565792
  rate: 117.4307386927069
  ...
ok 18 benchmark finished
#
1..18
# total 18
# pass  18
#
# ok

Running benchmark: /Users/kgryte/Coding/projects/personal/stdlib/stdlib/lib/node_modules/@stdlib/strided/math/special/abs/benchmark/benchmark.ndarray.js
TAP version 13
# @stdlib/strided/math/special/abs:ndarray:len=10
  ---
  iterations: 1000000
  elapsed: 0.393669522
  rate: 2540201.7278848425
  ...
ok 1 benchmark finished
# @stdlib/strided/math/special/abs:ndarray:len=10
  ---
  iterations: 1000000
  elapsed: 0.38124097
  rate: 2623012.9463787694
  ...
ok 2 benchmark finished
# @stdlib/strided/math/special/abs:ndarray:len=10
  ---
  iterations: 1000000
  elapsed: 0.374995224
  rate: 2666700.6297658873
  ...
ok 3 benchmark finished
# @stdlib/strided/math/special/abs:ndarray:len=100
  ---
  iterations: 1000000
  elapsed: 0.659446001
  rate: 1516424.3902966667
  ...
ok 4 benchmark finished
# @stdlib/strided/math/special/abs:ndarray:len=100
  ---
  iterations: 1000000
  elapsed: 0.660899761
  rate: 1513088.760218208
  ...
ok 5 benchmark finished
# @stdlib/strided/math/special/abs:ndarray:len=100
  ---
  iterations: 1000000
  elapsed: 0.656215304
  rate: 1523890.0920238213
  ...
ok 6 benchmark finished
# @stdlib/strided/math/special/abs:ndarray:len=1000
  ---
  iterations: 100000
  elapsed: 0.380436619
  rate: 262855.87402930844
  ...
ok 7 benchmark finished
# @stdlib/strided/math/special/abs:ndarray:len=1000
  ---
  iterations: 100000
  elapsed: 0.36811588
  rate: 271653.5890817859
  ...
ok 8 benchmark finished
# @stdlib/strided/math/special/abs:ndarray:len=1000
  ---
  iterations: 100000
  elapsed: 0.372928833
  rate: 268147.6763154969
  ...
ok 9 benchmark finished
# @stdlib/strided/math/special/abs:ndarray:len=10000
  ---
  iterations: 10000
  elapsed: 0.753502108
  rate: 13271.363004600911
  ...
ok 10 benchmark finished
# @stdlib/strided/math/special/abs:ndarray:len=10000
  ---
  iterations: 10000
  elapsed: 0.736914559
  rate: 13570.094223094322
  ...
ok 11 benchmark finished
# @stdlib/strided/math/special/abs:ndarray:len=10000
  ---
  iterations: 10000
  elapsed: 0.740829738
  rate: 13498.37821980089
  ...
ok 12 benchmark finished
# @stdlib/strided/math/special/abs:ndarray:len=100000
  ---
  iterations: 1000
  elapsed: 0.792041357
  rate: 1262.5603337023726
  ...
ok 13 benchmark finished
# @stdlib/strided/math/special/abs:ndarray:len=100000
  ---
  iterations: 1000
  elapsed: 0.803587885
  rate: 1244.418959850297
  ...
ok 14 benchmark finished
# @stdlib/strided/math/special/abs:ndarray:len=100000
  ---
  iterations: 1000
  elapsed: 0.806175121
  rate: 1240.425279757548
  ...
ok 15 benchmark finished
# @stdlib/strided/math/special/abs:ndarray:len=1000000
  ---
  iterations: 100
  elapsed: 0.825726727
  rate: 121.10544170383865
  ...
ok 16 benchmark finished
# @stdlib/strided/math/special/abs:ndarray:len=1000000
  ---
  iterations: 100
  elapsed: 0.822104266
  rate: 121.63907199576556
  ...
ok 17 benchmark finished
# @stdlib/strided/math/special/abs:ndarray:len=1000000
  ---
  iterations: 100
  elapsed: 0.822530554
  rate: 121.57603084006531
  ...
ok 18 benchmark finished
#
1..18
# total 18
# pass  18
#
# ok

and here are the benchmarks for dabs, a strided array interface only supporting double-precision floating-point numbers.

Running benchmark: /Users/kgryte/Coding/projects/personal/stdlib/stdlib/lib/node_modules/@stdlib/strided/math/special/dabs/benchmark/benchmark.ndarray.native.js
TAP version 13
# @stdlib/strided/math/special/dabs::native:ndarray:len=10
  ---
  iterations: 1000000
  elapsed: 0.435212237
  rate: 2297729.5098437225
  ...
ok 1 benchmark finished
# @stdlib/strided/math/special/dabs::native:ndarray:len=10
  ---
  iterations: 1000000
  elapsed: 0.437873682
  rate: 2283763.6540119806
  ...
ok 2 benchmark finished
# @stdlib/strided/math/special/dabs::native:ndarray:len=10
  ---
  iterations: 1000000
  elapsed: 0.441242788
  rate: 2266325.993751993
  ...
ok 3 benchmark finished
# @stdlib/strided/math/special/dabs::native:ndarray:len=100
  ---
  iterations: 1000000
  elapsed: 0.688217091
  rate: 1453029.8841416016
  ...
ok 4 benchmark finished
# @stdlib/strided/math/special/dabs::native:ndarray:len=100
  ---
  iterations: 1000000
  elapsed: 0.698768618
  rate: 1431088.8815559258
  ...
ok 5 benchmark finished
# @stdlib/strided/math/special/dabs::native:ndarray:len=100
  ---
  iterations: 1000000
  elapsed: 0.675800943
  rate: 1479725.6653132548
  ...
ok 6 benchmark finished
# @stdlib/strided/math/special/dabs::native:ndarray:len=1000
  ---
  iterations: 100000
  elapsed: 0.311408527
  rate: 321121.5857297318
  ...
ok 7 benchmark finished
# @stdlib/strided/math/special/dabs::native:ndarray:len=1000
  ---
  iterations: 100000
  elapsed: 0.310411062
  rate: 322153.4675848633
  ...
ok 8 benchmark finished
# @stdlib/strided/math/special/dabs::native:ndarray:len=1000
  ---
  iterations: 100000
  elapsed: 0.308695609
  rate: 323943.7072783242
  ...
ok 9 benchmark finished
# @stdlib/strided/math/special/dabs::native:ndarray:len=10000
  ---
  iterations: 10000
  elapsed: 0.269504939
  rate: 37105.07138423908
  ...
ok 10 benchmark finished
# @stdlib/strided/math/special/dabs::native:ndarray:len=10000
  ---
  iterations: 10000
  elapsed: 0.278106815
  rate: 35957.40722858589
  ...
ok 11 benchmark finished
# @stdlib/strided/math/special/dabs::native:ndarray:len=10000
  ---
  iterations: 10000
  elapsed: 0.268065463
  rate: 37304.32069870933
  ...
ok 12 benchmark finished
# @stdlib/strided/math/special/dabs::native:ndarray:len=100000
  ---
  iterations: 1000
  elapsed: 0.265874271
  rate: 3761.17627417961
  ...
ok 13 benchmark finished
# @stdlib/strided/math/special/dabs::native:ndarray:len=100000
  ---
  iterations: 1000
  elapsed: 0.264879564
  rate: 3775.300687221004
  ...
ok 14 benchmark finished
# @stdlib/strided/math/special/dabs::native:ndarray:len=100000
  ---
  iterations: 1000
  elapsed: 0.262809734
  rate: 3805.0341012102695
  ...
ok 15 benchmark finished
# @stdlib/strided/math/special/dabs::native:ndarray:len=1000000
  ---
  iterations: 100
  elapsed: 0.282986746
  rate: 353.3734403235973
  ...
ok 16 benchmark finished
# @stdlib/strided/math/special/dabs::native:ndarray:len=1000000
  ---
  iterations: 100
  elapsed: 0.278570983
  rate: 358.9749331501623
  ...
ok 17 benchmark finished
# @stdlib/strided/math/special/dabs::native:ndarray:len=1000000
  ---
  iterations: 100
  elapsed: 0.284860537
  rate: 351.0489766436128
  ...
ok 18 benchmark finished
#
1..18
# total 18
# pass  18
#
# ok

Running benchmark: /Users/kgryte/Coding/projects/personal/stdlib/stdlib/lib/node_modules/@stdlib/strided/math/special/dabs/benchmark/benchmark.native.js
TAP version 13
# @stdlib/strided/math/special/dabs::native:len=10
  ---
  iterations: 1000000
  elapsed: 0.303859275
  rate: 3290997.1235862393
  ...
ok 1 benchmark finished
# @stdlib/strided/math/special/dabs::native:len=10
  ---
  iterations: 1000000
  elapsed: 0.306635692
  rate: 3261198.960491527
  ...
ok 2 benchmark finished
# @stdlib/strided/math/special/dabs::native:len=10
  ---
  iterations: 1000000
  elapsed: 0.303345013
  rate: 3296576.364022836
  ...
ok 3 benchmark finished
# @stdlib/strided/math/special/dabs::native:len=100
  ---
  iterations: 1000000
  elapsed: 0.551985124
  rate: 1811643.0253652995
  ...
ok 4 benchmark finished
# @stdlib/strided/math/special/dabs::native:len=100
  ---
  iterations: 1000000
  elapsed: 0.55578898
  rate: 1799244.0224345578
  ...
ok 5 benchmark finished
# @stdlib/strided/math/special/dabs::native:len=100
  ---
  iterations: 1000000
  elapsed: 0.61533631
  rate: 1625127.5664197356
  ...
ok 6 benchmark finished
# @stdlib/strided/math/special/dabs::native:len=1000
  ---
  iterations: 100000
  elapsed: 0.288136087
  rate: 347058.2287736836
  ...
ok 7 benchmark finished
# @stdlib/strided/math/special/dabs::native:len=1000
  ---
  iterations: 100000
  elapsed: 0.293719636
  rate: 340460.7242533829
  ...
ok 8 benchmark finished
# @stdlib/strided/math/special/dabs::native:len=1000
  ---
  iterations: 100000
  elapsed: 0.310028647
  rate: 322550.8383423678
  ...
ok 9 benchmark finished
# @stdlib/strided/math/special/dabs::native:len=10000
  ---
  iterations: 10000
  elapsed: 0.291104736
  rate: 34351.8973184964
  ...
ok 10 benchmark finished
# @stdlib/strided/math/special/dabs::native:len=10000
  ---
  iterations: 10000
  elapsed: 0.270915162
  rate: 36911.92447914746
  ...
ok 11 benchmark finished
# @stdlib/strided/math/special/dabs::native:len=10000
  ---
  iterations: 10000
  elapsed: 0.26407803
  rate: 37867.59542245903
  ...
ok 12 benchmark finished
# @stdlib/strided/math/special/dabs::native:len=100000
  ---
  iterations: 1000
  elapsed: 0.26354886
  rate: 3794.3628365533436
  ...
ok 13 benchmark finished
# @stdlib/strided/math/special/dabs::native:len=100000
  ---
  iterations: 1000
  elapsed: 0.260865537
  rate: 3833.392526664034
  ...
ok 14 benchmark finished
# @stdlib/strided/math/special/dabs::native:len=100000
  ---
  iterations: 1000
  elapsed: 0.265353175
  rate: 3768.562407440574
  ...
ok 15 benchmark finished
# @stdlib/strided/math/special/dabs::native:len=1000000
  ---
  iterations: 100
  elapsed: 0.279323951
  rate: 358.0072515872439
  ...
ok 16 benchmark finished
# @stdlib/strided/math/special/dabs::native:len=1000000
  ---
  iterations: 100
  elapsed: 0.289404916
  rate: 345.5366321420746
  ...
ok 17 benchmark finished
# @stdlib/strided/math/special/dabs::native:len=1000000
  ---
  iterations: 100
  elapsed: 0.281906876
  rate: 354.72706951638884
  ...
ok 18 benchmark finished
#
1..18
# total 18
# pass  18
#
# ok

Running benchmark: /Users/kgryte/Coding/projects/personal/stdlib/stdlib/lib/node_modules/@stdlib/strided/math/special/dabs/benchmark/benchmark.js
TAP version 13
# @stdlib/strided/math/special/dabs:len=10
  ---
  iterations: 10000000
  elapsed: 0.302783518
  rate: 33026896.794296447
  ...
ok 1 benchmark finished
# @stdlib/strided/math/special/dabs:len=10
  ---
  iterations: 10000000
  elapsed: 0.304417186
  rate: 32849656.52366289
  ...
ok 2 benchmark finished
# @stdlib/strided/math/special/dabs:len=10
  ---
  iterations: 10000000
  elapsed: 0.297130217
  rate: 33655277.813767426
  ...
ok 3 benchmark finished
# @stdlib/strided/math/special/dabs:len=100
  ---
  iterations: 1000000
  elapsed: 0.189698545
  rate: 5271521.718840806
  ...
ok 4 benchmark finished
# @stdlib/strided/math/special/dabs:len=100
  ---
  iterations: 1000000
  elapsed: 0.190088704
  rate: 5260701.866850542
  ...
ok 5 benchmark finished
# @stdlib/strided/math/special/dabs:len=100
  ---
  iterations: 1000000
  elapsed: 0.193113801
  rate: 5178293.80821933
  ...
ok 6 benchmark finished
# @stdlib/strided/math/special/dabs:len=1000
  ---
  iterations: 100000
  elapsed: 0.191335675
  rate: 522641.6871814417
  ...
ok 7 benchmark finished
# @stdlib/strided/math/special/dabs:len=1000
  ---
  iterations: 100000
  elapsed: 0.176457849
  rate: 566707.5767199225
  ...
ok 8 benchmark finished
# @stdlib/strided/math/special/dabs:len=1000
  ---
  iterations: 100000
  elapsed: 0.174965862
  rate: 571540.064198352
  ...
ok 9 benchmark finished
# @stdlib/strided/math/special/dabs:len=10000
  ---
  iterations: 10000
  elapsed: 0.489760559
  rate: 20418.1406939304
  ...
ok 10 benchmark finished
# @stdlib/strided/math/special/dabs:len=10000
  ---
  iterations: 10000
  elapsed: 0.503766985
  rate: 19850.44732536413
  ...
ok 11 benchmark finished
# @stdlib/strided/math/special/dabs:len=10000
  ---
  iterations: 10000
  elapsed: 0.580820838
  rate: 17217.01314028957
  ...
ok 12 benchmark finished
# @stdlib/strided/math/special/dabs:len=100000
  ---
  iterations: 1000
  elapsed: 0.677884666
  rate: 1475.1771948179753
  ...
ok 13 benchmark finished
# @stdlib/strided/math/special/dabs:len=100000
  ---
  iterations: 1000
  elapsed: 0.628820705
  rate: 1590.2784244357858
  ...
ok 14 benchmark finished
# @stdlib/strided/math/special/dabs:len=100000
  ---
  iterations: 1000
  elapsed: 0.649244846
  rate: 1540.2509641177805
  ...
ok 15 benchmark finished
# @stdlib/strided/math/special/dabs:len=1000000
  ---
  iterations: 100
  elapsed: 0.650647435
  rate: 153.69306727536704
  ...
ok 16 benchmark finished
# @stdlib/strided/math/special/dabs:len=1000000
  ---
  iterations: 100
  elapsed: 0.641909981
  rate: 155.78508351625086
  ...
ok 17 benchmark finished
# @stdlib/strided/math/special/dabs:len=1000000
  ---
  iterations: 100
  elapsed: 0.676401649
  rate: 147.8411534741838
  ...
ok 18 benchmark finished
#
1..18
# total 18
# pass  18
#
# ok

Running benchmark: /Users/kgryte/Coding/projects/personal/stdlib/stdlib/lib/node_modules/@stdlib/strided/math/special/dabs/benchmark/benchmark.ndarray.js
TAP version 13
# @stdlib/strided/math/special/dabs:ndarray:len=10
  ---
  iterations: 10000000
  elapsed: 0.366156019
  rate: 27310762.30102884
  ...
ok 1 benchmark finished
# @stdlib/strided/math/special/dabs:ndarray:len=10
  ---
  iterations: 10000000
  elapsed: 0.363239379
  rate: 27530054.77415487
  ...
ok 2 benchmark finished
# @stdlib/strided/math/special/dabs:ndarray:len=10
  ---
  iterations: 10000000
  elapsed: 0.364019379
  rate: 27471064.940199245
  ...
ok 3 benchmark finished
# @stdlib/strided/math/special/dabs:ndarray:len=100
  ---
  iterations: 1000000
  elapsed: 0.256248388
  rate: 3902463.573741584
  ...
ok 4 benchmark finished
# @stdlib/strided/math/special/dabs:ndarray:len=100
  ---
  iterations: 1000000
  elapsed: 0.248516694
  rate: 4023874.549047397
  ...
ok 5 benchmark finished
# @stdlib/strided/math/special/dabs:ndarray:len=100
  ---
  iterations: 1000000
  elapsed: 0.254779693
  rate: 3924959.592442872
  ...
ok 6 benchmark finished
# @stdlib/strided/math/special/dabs:ndarray:len=1000
  ---
  iterations: 100000
  elapsed: 0.254221689
  rate: 393357.46841017966
  ...
ok 7 benchmark finished
# @stdlib/strided/math/special/dabs:ndarray:len=1000
  ---
  iterations: 100000
  elapsed: 0.248581277
  rate: 402282.91207949666
  ...
ok 8 benchmark finished
# @stdlib/strided/math/special/dabs:ndarray:len=1000
  ---
  iterations: 100000
  elapsed: 0.245398908
  rate: 407499.7758343733
  ...
ok 9 benchmark finished
# @stdlib/strided/math/special/dabs:ndarray:len=10000
  ---
  iterations: 10000
  elapsed: 0.450701244
  rate: 22187.646768509916
  ...
ok 10 benchmark finished
# @stdlib/strided/math/special/dabs:ndarray:len=10000
  ---
  iterations: 10000
  elapsed: 0.433642286
  rate: 23060.481698502994
  ...
ok 11 benchmark finished
# @stdlib/strided/math/special/dabs:ndarray:len=10000
  ---
  iterations: 10000
  elapsed: 0.451250753
  rate: 22160.62784054789
  ...
ok 12 benchmark finished
# @stdlib/strided/math/special/dabs:ndarray:len=100000
  ---
  iterations: 1000
  elapsed: 0.689415555
  rate: 1450.5039707147312
  ...
ok 13 benchmark finished
# @stdlib/strided/math/special/dabs:ndarray:len=100000
  ---
  iterations: 1000
  elapsed: 0.75929212
  rate: 1317.0161702718578
  ...
ok 14 benchmark finished
# @stdlib/strided/math/special/dabs:ndarray:len=100000
  ---
  iterations: 1000
  elapsed: 0.749427405
  rate: 1334.3520577553472
  ...
ok 15 benchmark finished
# @stdlib/strided/math/special/dabs:ndarray:len=1000000
  ---
  iterations: 100
  elapsed: 0.7488877
  rate: 133.5313692560313
  ...
ok 16 benchmark finished
# @stdlib/strided/math/special/dabs:ndarray:len=1000000
  ---
  iterations: 100
  elapsed: 0.735611737
  rate: 135.9412784899597
  ...
ok 17 benchmark finished
# @stdlib/strided/math/special/dabs:ndarray:len=1000000
  ---
  iterations: 100
  elapsed: 0.729081344
  rate: 137.1589066473219
  ...
ok 18 benchmark finished
#
1..18
# total 18
# pass  18
#
# ok

A couple of observations:

The strided array interface performs comparably to the type specific interface, especially for larger datasets. This is expected, as most of the time for larger arrays should be spent in the loops, which are essentially the same.
For shorter arrays, we're better off sticking with JavaScript (N <= 100), as the relative overhead of calling into C is far greater.
The performance discrepancy can be largely linked to more extensive argument munging with the strided array interface. This is expected, as we need to figure out input array types and then dispatch to the relevant implementation.
Overall, I think the current implementation performs quite well.

The first thing that strikes me is that data.js seems to be the central feature of this implementation, but it's not readily apparent to me what it actually means (the grouping, in particular), despite having worked through it just a couple days ago.

https://github.com/stdlib-js/stdlib/blob/7fb13b1cdcb55a60bfa1a343932793439c716a85/lib/node_modules/%40stdlib/strided/math/special/abs/lib/data.js#L33-L39

It would be nice if the meaning could be communicated somehow so that newcomers can understand the motivation, maintenance, and extension more easily.

Ah, I see now that I failed to view it together with https://github.com/stdlib-js/stdlib/blob/7fb13b1cdcb55a60bfa1a343932793439c716a85/lib/node_modules/%40stdlib/strided/math/special/abs/lib/types.json, which seems to contain the information necessary to decode it.

@rreusser data.js mirrors the C API, which can be found here. In short, "data" refers to whatever data should be provided to an "apply" function during invocation. Here, the data is comprised of callbacks. However, the general concept is more broadly applicable to any sort of function data, such as meta data, hashes, additional parameters, callbacks, etc. In C, data is provided as an opaque pointer.

And yeah, the "grouping" was my way of grouping related functionality (e.g., these callbacks are for the int32 interfaces, these are for the uint16 interfaces, etc).

I should state that the strided array function data (i.e., types.json, data.js, abs.functions.js, etc) do not need to be split over multiple files; they can be included in the same file, as in C. I chose to split into separate files partly for convenience and partly to allow for reuse of such data in other contexts (e.g., data.js to be reused for both the main interfaces and the ndarray interfaces, types.json to allow for independent consumption when generating tests and benchmarks, etc).

The second thing of note is that you're right—the number of files does make for a fair bit of maintenance. Correct me if I'm wrong though, but the unit cost of importing one additional strided function interface is quite small, right? All of the js files I see lean on common implementation and, using a bundler with hoisting and whatnot, would compile down to perhaps 1kb on the high side.

Yes, that should be true. The goal would be to abstract away the underlying implementation (i.e., the loops), so bundles should be relatively small. This is unlike in C where we have to rely on macros which then expand into multiple loop implementations. In JS, there would only be one.

The splitting makes sense and I understand the motivation for testing and repurposing. The downside, of course, is that it requires more knowledge, context, and navigation in order to comprehend and work with, e.g. matching up lines across multiple files. If you delete a line from data.json, you have to track down every other parallel list and ensure they are all synchronized.

Yes, that is correct. abs is a bit of an extreme case, given the number of accepted input and output types. For sin, for example, the number of accepted type combinations would be significantly smaller, as the only accepted output types would be float32 and float64. This is true of other transcendental functions. Part of my motivation for using abs was to see how imposing the maintenance burden would be for a more "extreme" example.

...but you're right that there is fairly significant room for developer error when writing these interfaces (e.g., matching elements across the various meta data arrays).

Yeah, it's not a strong feeling. I always consider the elm lang thing, "Make impossible states impossible." That is, if inconsistent input (parallel calling signature lists of different lengths) is possible, then maybe it's not a matter of validation but a sign the input format is suboptimal. Following this to the logical end, input would look like (very hand-wavily speaking):

module.exports = [
  { args: [ 'uint8' ], fcn: absl },
  { args: [ 'float32' ], fcn: absf },
  ...
]

Which probably is too far. Or just

module.exports = [
  [ [ 'uint8' ], absl ],
  [ [ 'float32' ], absf ],
]

or even just collocating with

module.exports = [
  [ 'uint8', 'float32' ],
  [ absl, absf ]
]

or some other 'compressed' representation. Whether this is actually any better or more preferable is definitely not something I'm willing to commit to. I'd only offer that it might be possible to structure the input so that it's not just clearer but so that it's possible to make inconsistent input impossible without having to validate anything.

Finally, random example: cpow( complexArray, complexPower ). I assume but don't fully understand the implementation well enough to see, that there's some allowance for a second argument which could be a constant for which "Number" is not sufficient? Or more likely, would this be a case for broadcasting and the complex power would be implicitly converted into an element-wise operation?

Re: compressed form. I suppose one design goal toward which I biased the implementation was to make the JS implementation mirror the C implementation. In C, we need separate lists due to types. In JS (to state the obvious), we are not bound by such restrictions.

This stated, the versions you provided above would also be reasonable. And for that matter, could easily be supported via utility functions which take one of the above forms and then map to the current form. But this would mean that there'd be more than one way to do the same thing which may or may not be desirable.

Re: broadcasting. This is not explicitly supported in the sense that you cannot simply specify number. The reasons have to do with how this would work in the C layer where it is not clear how a JS number should be cast. NumPy has logic for trying to determine the minimum type which can hold a scalar constant and then types that value accordingly. From there, the scalar constant is allowed to participate in type promotion.

Here, I've chosen not to allow scalar constants to be provided in the current implementation to avoid the added complexity and ambiguous design decisions. Instead, in order to broadcast, a user simply provides a strided array having a "stride" of 0. This ensures that, during the loop, the same array element is always accessed during each loop iteration. This also means that a user is responsible for explicitly typing a value to broadcast (e.g., new Int32Array( [ 3 ] ) or new Float64Array( [3.0 ] ), etc).

Yeah, honestly, I think the way you designed it is the best option. It's straightforward, probably not that common to work with, and can at least be managed with some simple validations. 😄

At the JS layer, I did include validation. See here. But this does not provide the sort of locality which would make identifying bugs easier.

Re: how common. Well, if we decide to go with the current implementation, it will be rolled out to at least 200 packages in stdlib. 😅

We may want some way for a user to query what types are supported by a given strided API. What would this look like? In NumPy, you can do np.sin.types which returns a list of supported input types.

Introspection is always nice. The design suggests this would be a pretty straightforward addition, no?

One thing to note is that I’ve not included support for automatic casting and/or detection of overlapping buffers. My sense is that these behaviors, if desired, can be supported by wrapping these "lower-level" interfaces and performing the casting and/or copies there.

Re: overlapping buffers, I can imagine strange use cases with interleaved real/complex, for example, where the input and output may legitimately overlap, at least when treated as a large block, though the individual components may never intersect. Is that indeed the case? It seems better to push that behavior elsewhere. Same for casting. Would something else need to receive the operator's type signature and cast accordingly before passing it off to this interface? Does that relate to the introspection in the first part of this question?

Re: introspection. Yes, this would be a straightforward addition. Can hang a method off the dispatch function (see here) quite easily based on the provided types array. So this would not have to be anything a developer would have to do. And does not affect how things are structured in terms of files, so long as the types are provided to the dispatch factory as a list.

Re: overlapping buffers. The more pressing case is, say, sin( 10, x, 1, y, -1 ), where x and y point to the same memory (i.e., we're doing an in-place operation; however, we're doing this by putting the first result in the last element). In this case, once we get to x[5], we've already overwritten memory, thus producing unexpected results. In NumPy world, we'd detect this condition and perform a copy of x before proceeding with computation.

My thinking is that this should be pushed elsewhere, perhaps to the ndarray abstraction layer, where we're willing to take a greater perf hit for the sake of convenience and where we already have the concern of "views". Here, we assume more ideal conditions.

Re: casting. Agreed. My take is that casting can be something which is added as an abstraction layer above. For the current implementation, we simply assume that, if you don't have acceptable input types, you'll have casted them accordingly before invoking the strided array function. In short, a user is forced to be explicit. Things like casting can also be punted to, e.g., the ndarray abstraction layer.

Re overlapping: sorry, in particular I was trying to invent a case where trivially "overlapping" use of buffers may not lead to conflicting memory access, so that an incomplete, overzealous check would be problematic. I'm not sure how complex it is to check two strided array patterns for overlapping access. I think there's a chance it might actually be somewhat complicated (lots of LCD/GCD and maybe some lattice sort of thing?) or at least rather expensive to fully and correctly check for intersections.

You are correct! Not straightforward at all. NumPy provides two APIs: may_share_memory and shares_memory, with the former just performing bounds checks, and the latter using a much more complicated algo which can be found here.

Concerning the data.js and types.json debate, the solution most in line with the C file would be to have them both hanging off of one main export. It sounds like you decided against that due to the need to independently require types.json. My question is how be much of a problem it would be to require some other unnecessary files in say tests when doing var types = require( '...').types, as neither performance nor a small bundle size would be the main concern. or would they?

I guess the question is also how often you envision one having to add new type signatures or re-arranging the ones in these files. If not super-rare, then having them colocated might be a good idea I think.

@Planeshifter I think the problem is not so much that the info is split across multiple files, but that the info is split across multiple data structures. Because the data resides in separate lists, the data requires more effort be kept in sync. This stands in contrast to @rreusser's other proposals where function data and types can be found in the same data structure.

Yeah, understood and the arguments put forth made sense to me, but I gathered that preserving a comparable format like in C was outweighing the benefits of having them both in one data structure. I was just inquiring about how big of a benefit is achieved in having them in two separate files. Although I suppose one could also add a one-line comment in data.js explaining where to find the corresponding types or just amend the existing NOTE.

Concerning the question of when to call into C becomes worthwhile, you show benchmarks above showing that >100 might be a good cutoff. How much variation do we expect across packages and would it be a viable strategy to set a magic constant for each package as a cutoff for when to start calling into C?

Re: separate files. Yeah, I don't know if having them in the same file actually achieves much, especially if the respective lists are long. It would just create a long file and require developers to scroll up and down to match elements.

I think you are right that, at the very least, we should add/update a comment in data.js to point to the corresponding types file.

Re: heuristics. I really don't know. And I don't know how reproducible this is across platforms. If I'd hazard a guess, I'd guess that the heuristic should be more or less the same, but we'd definitely want to run the benchmarks to confirm before hardcoding. I can imagine for more compute intensive functions (say, functions which need to perform internal iteration to converge), the magic constant could be smaller. For things like an identity function, the magic constant could larger. So I don't have a good handle on variability and whether we can just count on copying-and-pasting implementations without confirming the constant's applicability. In a more ideal world, we'd do "burn ins" on install and tailor the heuristic for each implementation to the host platform.

I've updated the OP with some updates.

Re: data.js. I've added a comment pointing to the types.json file.

Re: heuristics. I tried adding a heuristic; however, I encountered unexpected results. Namely, if I only benchmarked a single set of types (e.g., float64 input and output arrays), using a heuristic had the intended effect: for shorter arrays, we could match JS performance when invoking the native add-on interface, and, for longer arrays, we could call into C to exceed JS performance. However, once I began providing different typed array data types to the same interface, I found that V8 seemed to de-optimize the interfaces, resulting in worse JS performance for shorter arrays compared to the native interfaces. My hypothesis is that V8 de-optimized due to polymorphism. Accordingly, as the intent of these multiple-dispatch interfaces is to support multiple input array data types, I chose not to include any heuristics. While calling into the native add-ons is not the most performant when the array types are severely constrained, the native add-ons have more consistent performance and don't seem to hit the same perf cliff as the JS implementations do.

At this point, I am reasonably content with the strided array interfaces (e.g., abs). We still have the issue of several files in ./lib, but not sure we can abstract away any more of the implementation to make one or more files unnecessary. If anyone has any ideas, let me know.

stdlib-js / stdlib

RFC: strided array function interfaces #364