Open homonoidian opened 2 years ago
Spoiler alert ...
require "benchmark"
Benchmark.ips do |x|
x.report("tape") do
t = Tape(Int32).new
1_000_000.times do |x|
t = t.add(x)
end
1_000_000.times do |x|
t = t.to?(x).not_nil!
end
100_000.times do |x|
t = t.to?(x).not_nil!
t = t.add(x)
end
t = t.map! do |i|
i + 1 if i.even?
end
t = t.to?(t.count).not_nil!
t.count.times do |x|
t = t.drop?.not_nil!
end
end
x.report("array") do
a = [] of Int32
ac = 0
1_000_000.times do |x|
a << x
ac += 1
end
1_000_000.times do |x|
ac = x
end
100_000.times do |x|
ac = x
a.insert(ac, x)
ac += 1
end
a.map! do |i|
i.even? ? i + 1 : i
end
ac = a.size
a.size.times do |x|
ac -= 1
a.delete_at(ac)
end
end
end
tape 24.45 ( 40.91ms) (± 1.00%) 17.4MB/op fastest
array 47.82m ( 20.91s ) (± 0.00%) 27.6MB/op 511.25× slower
I guess Novika's tape is going to be extremely fast after all.
Giving these amazing results with linear insertion, it still struggles with non-linear insertion, for it uses two buffers (one before & after cursor) unless cursor is at the end. 17x slower than array right now, because it realloc's/copies the two huge buffers a lot. Using a heuristic to figure out semi-randomness of insertion & turning to contiguous buffer is going to be beneficial. Detecting non-linear cursor movement via some sort of cursor neighborhood heuristic may help. Alternatively, reducing allocation after cursor movement may help.
require "benchmark"
Benchmark.ips do |x|
probe = (0..1_000_000).sample(10_000)
x.report("random insertion tape") do
t = Tape(Int32).new
1_000_000.times do |x|
t = t.add(x)
end
probe.each_with_index do |x, i|
t = t.to?(x)
t = t.not_nil!.add(x)
end
end
x.report("random insertion ary") do
t = [] of Int32
1_000_000.times do |x|
t << x
end
probe.each_with_index do |x, i|
t.insert(x, x)
end
end
end
random insertion tape 80.19m ( 12.47s ) (± 0.00%) 56.0GB/op 17.57× slower
random insertion ary 1.41 (709.63ms) (± 2.31%) 22.0MB/op fastest
We would also like to retain these two buffers for as long as possible because then Tape#slice
has no cost.
UPD: if integrated with CoW of course
After cursor movement & tape mod we can concat two buffers + 1, add element, then by math figure out when first & second buffer starts. Realloc may not be possible.
As nothing is free in this world, we'll have to make some compromises (depending, of course, on whether the new implementation of Tape increases the performance of Novika sufficiently enough).
ahead
is expensive in terms of computation and memory unless you are (a) writing to the end, or (b) have set a single write-point and begun writing, i.e., without changing the cursor during the process of writing.eject
and inject
native words and implementing a fast version of them, as both are the most often used words mutating ahead, and ahead
is the most random-point mutated entity in Novika.Note that at certain points, the code below makes tens if not hundreds of millions of iterations (and even more reallocations). Note: memory output seems to be broken, or maybe it doesn't decrease when garbage is cleared. Note: I may have other ideas, these are not the final results. It is better though to plug this implementation of Tape into Novika some time later, to see how much it improves the situation.
require "benchmark"
Benchmark.ips do |x|
x.report("tape") do # Does ~1000000000000 iterations, each making ~8 ptrs of N, N' < N
1_000_000.times do
t = Tape(Int32).new
1000.times do |x|
t = t.add(x)
end
t
.to?(4).not_nil!
.add(5).not_nil!
.to?(2).not_nil!
.add(100).not_nil!
.add(200).not_nil!
.add(300).not_nil!
.to?(2).not_nil!
.add(100).not_nil!
.add(200).not_nil!
.add(300).not_nil!
.to?(t.count - 3).not_nil!
.add(680).not_nil!
.to?(0).not_nil!
.add(100).not_nil!
.add(100).not_nil!
.drop?.not_nil!
.to?(t.count).not_nil!
.add(1000).not_nil!
t.count.times { t = t.drop?.not_nil! }
end
end
x.report("array") do
1_000_000.times do
ary = [] of Int32
cur = 0
1_000.times do |i|
ary.insert(cur, i)
cur += 1
end
cur = 4
ary.insert(cur, 5)
cur += 1
cur = 2
ary.insert(cur, 100)
cur += 1
ary.insert(cur, 200)
cur += 1
ary.insert(cur, 300)
cur += 1
cur = 2
ary.insert(cur, 100)
cur += 1
ary.insert(cur, 200)
cur += 1
ary.insert(cur, 300)
cur += 1
cur = ary.size - 3
ary.insert(cur, 680)
cur += 1
cur = 0
ary.insert(cur, 800)
cur += 1
ary.insert(cur, 100)
cur += 1
ary.delete_at(cur - 1)
cur -= 1
cur = ary.size
ary.insert(cur, 1000)
cur += 1
ary.size.times { ary.delete_at(cur - 1); cur -= 1 }
end
end
probe = (0..1_000_000).sample(100_000)
x.report("random insertion tape") do
t = Tape(Int32).new
1_000_000.times do |x|
t = t.add(x)
end
probe.each_with_index do |x, i|
t = t.to?(x)
t = t.not_nil!.add(x)
end
end
x.report("random insertion ary") do
t = [] of Int32
1_000_000.times do |x|
t << x
end
probe.each_with_index do |x, i|
t.insert(x, x)
end
end
x.report("tape test") do
t = Tape(Int32).new
1_000_000.times do |x|
t = t.add(x)
end
1_000_000.times do |x|
t = t.to?(x).not_nil!
end
100_000.times do |x|
t = t.to?(x).not_nil!
t = t.add(x)
end
t = t.map! do |i|
i + 1 if i.even?
end
t = t.to?(t.count).not_nil!
t.count.times do |x|
t = t.drop?.not_nil!
end
end
x.report("array test") do
a = [] of Int32
ac = 0
1_000_000.times do |x|
a << x
ac += 1
end
1_000_000.times do |x|
ac = x
end
100_000.times do |x|
ac = x
a.insert(ac, x)
ac += 1
end
a.map! do |i|
i.even? ? i + 1 : i
end
ac = a.size
a.size.times do |x|
ac -= 1
a.delete_at(ac)
end
end
end
tape 65.99m ( 15.15s ) (± 0.00%) 38.9GB/op fastest
array 59.73m ( 16.74s ) (± 0.00%) 9.9GB/op 1.10× slower
random insertion tape 5.53m (180.92s ) (± 0.00%) 778GB/op 7775.89× slower
random insertion ary 125.91m ( 7.94s ) (± 0.00%) 27.6MB/op 341.35× slower
tape test 42.98 ( 23.27ms) (± 0.68%) 25.0MB/op fastest
array test 44.63m ( 22.41s ) (± 0.00%) 27.6MB/op 963.12× slower
Another approach would be to have a buffer and a claims array.
A claim is a range. Whenever a copy of substrate is made, a new claim is added to the claims array (or the refcount of an existing claim is incremented, if there is already a claim with the same begin&end).
A claim is basically the bounds of a particular substrate. Upon a mutation, claim conflict is checked (the way this is done depends on the kind of mutation, e.g., insert/push/map/etc.)
If a claim conflict is detected, refcount of the current claim is decremented; if its refcount = 0, then its is removed from the claims array. If conflict was detected, a copy of the buffer is made with the current claim attached. Buffer could (and probably should) be copied according to the bounds of the claim ± power 2 or smth. rather than in its entirety.
This allows to insert and drop in refs without copying in some cases -- that is, the one who is pushed first out of any number of same-bound copies wins in performance over everyone else. If you have two generations of substrates, A
and its copy a
, and they point to the same buffer, then N pushes to a
is free (copy-less), and the N drops are free as well for a
.
However, drops that conflict with A
's claim (e.g. a drop when A
= a
) or inserts that conflict with A
's claim (e.g. insert in the middle of a
) make a personal copy of the buffer for a
.
Note that A
and a
could be swapped in the explanation above.
I don't know how practical all of this would be for Novika in particular, though. Even with the most naive implementation I've tried writing, it seems to pay off only when having large and very large arrays, the current implementation degrading more and more up to points where the one I'm describing here is 20x or so faster. The benchmarks I've done are disputable, though, because I'm carefully designing them to hit the "sweet spot" rather than do dirty, massively repetitive work of everyday Novika code.
It should also be noted that the average size of a Novika substrate is ~3.56 elements (running the tests); 2793793 substrates are made. Small and heavily copied substrates rule in the Novika universe. Doing e.g. 100 enclose
results in [] 100 <<
which does make a copy, but not for the most obvious reason. The fact is that it is created with an empty tape, a copy is made due to block open of enclose
(which copies the entire definition of enclose
including []
), leaving the refsubstrate for []
, and then a mutation is introduced, henceforth copying the empty substrate and inserting 100 into the copy. The approach I've described here seems to prevent a copy in this case, because it is completely unnecessary as no conflicts with other claims exist. ALL OF THIS IS DISPUTABLE THOUGH AND I'D NEED TO GIVE THIS A BIG LONG THOUGHT.
Also, Idk how easy it'd be to realloc etc.
Also, Idk how fast this'd be on real Novika code.
Also, I think there cannot exist a buffer with no claims. This is simply impossible or is an invalid state. Most often ,buffers have exactly one claim. We could introduce a staticarray of claims with some claim threshold after which new claims make a copy and attach to the copy, but this needs a live implementation where stuff like this could be tested & magic numbers found, not simplistic speculation as I do here, now. What I see most clearly is that we don't need to create a claims array initially -- only when more than one claim is introduced. Otherwise, we'll simply store the "master claim" or something like that and wait out until new claims arrive.
A lower-level optimization would be to split buffer into chunks and copy them only, but this'd require some mind-boggling pointer jugglery and linked-listery which does not necessarily result in good performance. Another idea is to store changes (mutations) themselves rather than make a full copy & apply them on it, e.g. an "invalid range" (from the viewpoint of a particular buffer) and a "replacement sub-buffer", used when operation index falls in the "invalid range" (this is kind of a mirror-side version of the linked-listery I was just talking about). This paragraph is about some very complex stuff though, and maybe I'm trying to be smart here more than practical...
What needs to be understood is that reducing the amount of block copies is one of the most important pillars in optimizing Novika. The less space is used, the faster are mallocs and reallocs. The less the blobs of stuff are, the faster the GC can look at them and free then. The less copying is done, the less time is spent on copying!
An illustration to the 2nd from the end paragraph.
Again, I have no idea whether this will work and how much of an effect it would result in. Maybe it won't work. I. Don't. Know.
Maybe rolling out our own, custom Tape implementation (i.e. not using an Array underneath) would be beneficial in terms of performance, because we use an extremely small subset of Array features where e.g. index is always guaranteed to be in bounds (at least for tape), plus there are often a lot of 1-element insertions which means we'd want to have a heuristic to allocate more initially (I think that would make sense). Memory is cheap these days, electricity isn't :)
This should also reduce the number of objects we create. Currently there are (at worst, if no copy is made thanks to CoW we have only Block) 3 objects per block (Block itself -> RealSubstrate -> Array), having Block -> (copy-on-write Tape < Value) -> Pointer (somehow!) may be a bit better.
These are just my speculations & a bit of a reminder to myself, so don't think I'm too serious with it.