ta0kira / zeolite

Zeolite is a statically-typed, general-purpose programming language.
Apache License 2.0
18 stars 0 forks source link

Use object pools instead of `vector` in `ReturnTuple`, `ArgTuple`, and `ParamTuple`. #172

Closed ta0kira closed 3 years ago

ta0kira commented 3 years ago

This will drastically cut down on dynamic allocation, since these are used for every function call.

ta0kira commented 3 years ago

Initial tests look good when creating an array in-place in a dynamically-allocated unsigned char[]. No memory leaks with valgrind. But, I haven't gotten into pool-managed memory yet. It's still possible for this to drastically increase compile times, due to templates.


A possible low-cost prototyping option:

  1. Set a lower limit on array sizes, e.g., 2.
  2. Set a max limit for reused arrays, e.g., 1000.
  3. Add dynamically-allocated arrays (of one specific size) to a cache list up to the max limit.
ta0kira commented 3 years ago

Looks like thread_local only mostly works: There needs to be a thread_local cleanup so that a thread's pool doesn't leak, but there seems to be a race-condition with destructor-based thread_local cleanup.

ta0kira commented 3 years ago

I did some informal testing with PrimesDemo and there's roughly a 6.5% performance improvement. (Based on the number of primes computed in 0.1s.)

ta0kira commented 3 years ago

Using a std::atomic_flag as a spinlock (39dd338c826bde4f4eb8e229ed3666321151e208) had a massive improvement, compared to the initial improvement when I was using the pointer to spinlock.

ta0kira commented 3 years ago

Test program.

concrete SortingTest {
  @type run () -> ()
}

define SortingTest {
  run () {
    Vector<Int> values <- Vector:create<Int>()
    traverse (Counter.zeroIndexed(100000) -> Int i) {
      \ values.push((i*1000000009)%10000)
    }
    \ Sorting:sort<?>(values)
  }
}

Results for 100 samples each.

Improvement computed with 1 - (new median) / (baseline median), i.e., reduction in % of the baseline time.

Change Mean (s) Median (s) Std. Dev. (s) Improvement
0.17.0.0 with tuple-cache limits set to 0 2.971 2.970 0.022 (baseline)
0.17.0.0 2.671 2.670 0.021 10.1%
0.17.0.0 + #177 2.041 2.030 0.026 31.6%
0.17.0.0 + #177 + atomic_flag 1.647 1.650 0.010 44.4%

Since tuple-caching (this issue) and value-boxing are (mostly) independent, I think we can attribute a 21.5% (31.6%-10.1%) improvement to value-boxing* and a 22.9% (44.4%-21.5%) improvement to tuple-caching.

(I'm disregarding the interaction of both competing for the allocator in the baseline case.)