Use object pools instead of `vector` in `ReturnTuple`, `ArgTuple`, and `ParamTuple`.

ta0kira / zeolite

Zeolite is a statically-typed, general-purpose programming language.

Apache License 2.0

18 stars 0 forks source link

Use object pools instead of `vector` in `ReturnTuple`, `ArgTuple`, and `ParamTuple`. #172

Closed ta0kira closed 3 years ago

ta0kira commented 3 years ago

This will drastically cut down on dynamic allocation, since these are used for every function call.

Optimize out zero size; maybe even use constexpr for those.
The number of objects in the pools will depend on max recursion depth and number of threads. A few thousand of each should be sufficient.
Most functions take/return 0-2 values and 0-4 params (aside from value initialization, but that already requires dynamic allocation), so the pools could contain just 2-tuples or 4-tuples, and return a dynamically-allocated tuple if a pooled one isn't available.
Concurrency might be a problem, but maybe the pools could be thread_local. (AFAIK, construction and destruction are always done in the same thread.)

ta0kira commented 3 years ago

Initial tests look good when creating an array in-place in a dynamically-allocated unsigned char[]. No memory leaks with valgrind. But, I haven't gotten into pool-managed memory yet. It's still possible for this to drastically increase compile times, due to templates.

A possible low-cost prototyping option:

Set a lower limit on array sizes, e.g., 2.
Set a max limit for reused arrays, e.g., 1000.
Add dynamically-allocated arrays (of one specific size) to a cache list up to the max limit.

ta0kira commented 3 years ago

Looks like thread_local only mostly works: There needs to be a thread_local cleanup so that a thread's pool doesn't leak, but there seems to be a race-condition with destructor-based thread_local cleanup.

ta0kira commented 3 years ago

I did some informal testing with PrimesDemo and there's roughly a 6.5% performance improvement. (Based on the number of primes computed in 0.1s.)

ta0kira commented 3 years ago

Using a std::atomic_flag as a spinlock (39dd338c826bde4f4eb8e229ed3666321151e208) had a massive improvement, compared to the initial improvement when I was using the pointer to spinlock.

ta0kira commented 3 years ago

Test program.

concrete SortingTest {
  @type run () -> ()
}

define SortingTest {
  run () {
    Vector<Int> values <- Vector:create<Int>()
    traverse (Counter.zeroIndexed(100000) -> Int i) {
      \ values.push((i*1000000009)%10000)
    }
    \ Sorting:sort<?>(values)
  }
}

Results for 100 samples each.

Improvement computed with 1 - (new median) / (baseline median), i.e., reduction in % of the baseline time.

Change	Mean (s)	Median (s)	Std. Dev. (s)	Improvement
0.17.0.0 with tuple-cache limits set to 0	2.971	2.970	0.022	(baseline)
0.17.0.0	2.671	2.670	0.021	10.1%
0.17.0.0 + #177	2.041	2.030	0.026	31.6%
0.17.0.0 + #177 + `atomic_flag`	1.647	1.650	0.010	44.4%

Since tuple-caching (this issue) and value-boxing are (mostly) independent, I think we can attribute a 21.5% (31.6%-10.1%) improvement to value-boxing* and a 22.9% (44.4%-21.5%) improvement to tuple-caching.

(I'm disregarding the interaction of both competing for the allocator in the baseline case.)