stclib / STC

A modern, user friendly, generic, type-safe and fast C99 container library: String, Vector, Sorted and Unordered Map and Set, Deque, Forward List, Smart Pointers, Bitset and Random numbers.
MIT License
1.35k stars 74 forks source link

Feature request: custom type for `c_forrange` and `crange` #83

Closed SylvanBrocard closed 2 months ago

SylvanBrocard commented 9 months ago

Feature

I'd like for crange and c_forrange to take custom types like the STC containers.

Rationale

There are too many lost optimization opportunities by having everything be intptr_t (a.k.a. long long).

Benchmarks

This is my litmus test for functional-style chaining: summing the square of even integers in a range.

STC code

Here is the STC code:

#include <stdio.h>
#include "stc/algorithm.h"

int summing_squared_evens(int a, int b)
{
    crange r1 = crange_make(a, b);

    int sum = 0;
    c_filter(crange, r1
        , *value % 2 == 0
        && c_flt_map(*value * *value)
        && (sum += *value, 1)
    );

    return sum;
}

int main(int argc, char const *argv[])
{
    int b;
    if(scanf("%d", &b) != 1) return 1;
    printf("%d\n", summing_squared_evens(1, b));

    return 0;
}

ISO C code

And here is the equivalent C code:

#include <stdio.h>

int summing_squared_evens(int a, int b)
{
    int sum = 0;
    for (int i = a; i < b; i++)
    {
        if (i % 2 == 0)
        {
            int square = i * i;
            sum += square;
        }
    }

    return sum;
}

int main()
{
    int b;
    if (scanf("%d", &b) != 1) return 1;
    printf("%d\n", summing_squared_evens(1, b));

    return 0;
}

A simple look at the generated assembly will show you that the STC code is not vectorized, while the ISO C code is (tried with GCC 13.2).

Benchmark results (without the feature)

Here are the benchmarks:

Benchmark 1: ./nochain < in
  Time (mean ± σ):     430.0 ms ±   2.3 ms    [User: 430.0 ms, System: 0.0 ms]
  Range (min … max):   426.0 ms … 433.8 ms    10 runs

Benchmark 1: ./chain < in
  Time (mean ± σ):     999.6 ms ±  22.1 ms    [User: 999.6 ms, System: 0.0 ms]
  Range (min … max):   975.3 ms … 1036.1 ms    10 runs

You can cast the values by doing c_flt_map((int)*value * (int)*value), and it does vectorize, but it's still significantly slower than the ISO C, and probably too tricky of an an optimization.

Change

However, by applying this change:

diff --git a/include/stc/algo/crange.h b/include/stc/algo/crange.h
index 7343709a..262f8e4f 100644
--- a/include/stc/algo/crange.h
+++ b/include/stc/algo/crange.h
@@ -48,7 +48,7 @@ int main(void)
 #include "../priv/linkage.h"
 #include "../common.h"

-typedef intptr_t crange_value;
+typedef int crange_value;
 typedef struct { crange_value start, end, step, value; } crange;
 typedef struct { crange_value *ref, end, step; } crange_iter;

We get much better code generation.

Benchmark with the modification

Benchmark 1: ./chain < in
  Time (mean ± σ):     404.1 ms ±   4.7 ms    [User: 404.1 ms, System: 0.0 ms]
  Range (min … max):   398.9 ms … 413.0 ms    10 runs
SylvanBrocard commented 9 months ago

Forgot to specify: the benchmarks are run with an input value of INT32_MAX (2147483647).

tylov commented 2 months ago

I haven't tested it, but I'm surprised by the difference on a 64 bit platform. I've added a cirange type that uses int instead of ptrdiff_t in v50dev branch (for now, will do some test myself). The regular crange need to work with intervals outside +-2³¹ on 64-bit platforms.