novika-lang / novika

🪱 Novika is a free-form, moldable, interpreted programming language
https://novika-lang.github.io/
MIT License
15 stars 0 forks source link

Rolling out our own `Tape` #28

Open homonoidian opened 2 years ago

homonoidian commented 2 years ago

Maybe rolling out our own, custom Tape implementation (i.e. not using an Array underneath) would be beneficial in terms of performance, because we use an extremely small subset of Array features where e.g. index is always guaranteed to be in bounds (at least for tape), plus there are often a lot of 1-element insertions which means we'd want to have a heuristic to allocate more initially (I think that would make sense). Memory is cheap these days, electricity isn't :)

This should also reduce the number of objects we create. Currently there are (at worst, if no copy is made thanks to CoW we have only Block) 3 objects per block (Block itself -> RealSubstrate -> Array), having Block -> (copy-on-write Tape < Value) -> Pointer (somehow!) may be a bit better.

These are just my speculations & a bit of a reminder to myself, so don't think I'm too serious with it.

homonoidian commented 2 years ago

Spoiler alert ...

require "benchmark"

Benchmark.ips do |x|
  x.report("tape") do
    t = Tape(Int32).new

    1_000_000.times do |x|
      t = t.add(x)
    end

    1_000_000.times do |x|
      t = t.to?(x).not_nil!
    end

    100_000.times do |x|
      t = t.to?(x).not_nil!
      t = t.add(x)
    end

    t = t.map! do |i|
      i + 1 if i.even?
    end

    t = t.to?(t.count).not_nil!

    t.count.times do |x|
      t = t.drop?.not_nil!
    end
  end

  x.report("array") do
    a = [] of Int32
    ac = 0

    1_000_000.times do |x|
      a << x
      ac += 1
    end

    1_000_000.times do |x|
      ac = x
    end

    100_000.times do |x|
      ac = x
      a.insert(ac, x)
      ac += 1
    end

    a.map! do |i|
      i.even? ? i + 1 : i
    end

    ac = a.size

    a.size.times do |x|
      ac -= 1
      a.delete_at(ac)
    end
  end
end
 tape  24.45  ( 40.91ms) (± 1.00%)  17.4MB/op         fastest
array  47.82m ( 20.91s ) (± 0.00%)  27.6MB/op  511.25× slower
homonoidian commented 2 years ago

I guess Novika's tape is going to be extremely fast after all.

homonoidian commented 2 years ago

Giving these amazing results with linear insertion, it still struggles with non-linear insertion, for it uses two buffers (one before & after cursor) unless cursor is at the end. 17x slower than array right now, because it realloc's/copies the two huge buffers a lot. Using a heuristic to figure out semi-randomness of insertion & turning to contiguous buffer is going to be beneficial. Detecting non-linear cursor movement via some sort of cursor neighborhood heuristic may help. Alternatively, reducing allocation after cursor movement may help.

require "benchmark"

Benchmark.ips do |x|
  probe = (0..1_000_000).sample(10_000)

  x.report("random insertion tape") do
    t = Tape(Int32).new
    1_000_000.times do |x|
      t = t.add(x)
    end

    probe.each_with_index do |x, i|
      t = t.to?(x)
      t = t.not_nil!.add(x)
    end
  end

  x.report("random insertion ary") do
    t = [] of Int32
    1_000_000.times do |x|
      t << x
    end

    probe.each_with_index do |x, i|
      t.insert(x, x)
    end
  end
end
random insertion tape  80.19m ( 12.47s ) (± 0.00%)  56.0GB/op  17.57× slower
 random insertion ary   1.41  (709.63ms) (± 2.31%)  22.0MB/op        fastest
homonoidian commented 2 years ago

We would also like to retain these two buffers for as long as possible because then Tape#slice has no cost.

UPD: if integrated with CoW of course

homonoidian commented 2 years ago

After cursor movement & tape mod we can concat two buffers + 1, add element, then by math figure out when first & second buffer starts. Realloc may not be possible.

homonoidian commented 2 years ago

As nothing is free in this world, we'll have to make some compromises (depending, of course, on whether the new implementation of Tape increases the performance of Novika sufficiently enough).

Note that at certain points, the code below makes tens if not hundreds of millions of iterations (and even more reallocations). Note: memory output seems to be broken, or maybe it doesn't decrease when garbage is cleared. Note: I may have other ideas, these are not the final results. It is better though to plug this implementation of Tape into Novika some time later, to see how much it improves the situation.

require "benchmark"

Benchmark.ips do |x|
  x.report("tape") do # Does ~1000000000000 iterations, each making ~8 ptrs of N, N' < N
    1_000_000.times do
      t = Tape(Int32).new
      1000.times do |x|
        t = t.add(x)
      end
      t
        .to?(4).not_nil!
        .add(5).not_nil!
        .to?(2).not_nil!
        .add(100).not_nil!
        .add(200).not_nil!
        .add(300).not_nil!
        .to?(2).not_nil!
        .add(100).not_nil!
        .add(200).not_nil!
        .add(300).not_nil!
        .to?(t.count - 3).not_nil!
        .add(680).not_nil!
        .to?(0).not_nil!
        .add(100).not_nil!
        .add(100).not_nil!
        .drop?.not_nil!
        .to?(t.count).not_nil!
        .add(1000).not_nil!
      t.count.times { t = t.drop?.not_nil! }
    end
  end
  x.report("array") do
    1_000_000.times do
      ary = [] of Int32
      cur = 0
      1_000.times do |i|
        ary.insert(cur, i)
        cur += 1
      end

      cur = 4
      ary.insert(cur, 5)
      cur += 1

      cur = 2

      ary.insert(cur, 100)
      cur += 1

      ary.insert(cur, 200)
      cur += 1

      ary.insert(cur, 300)
      cur += 1

      cur = 2

      ary.insert(cur, 100)
      cur += 1

      ary.insert(cur, 200)
      cur += 1

      ary.insert(cur, 300)
      cur += 1

      cur = ary.size - 3
      ary.insert(cur, 680)
      cur += 1

      cur = 0
      ary.insert(cur, 800)
      cur += 1

      ary.insert(cur, 100)
      cur += 1

      ary.delete_at(cur - 1)
      cur -= 1

      cur = ary.size
      ary.insert(cur, 1000)
      cur += 1

      ary.size.times { ary.delete_at(cur - 1); cur -= 1 }
    end
  end

  probe = (0..1_000_000).sample(100_000)
  x.report("random insertion tape") do
    t = Tape(Int32).new
    1_000_000.times do |x|
      t = t.add(x)
    end

    probe.each_with_index do |x, i|
      t = t.to?(x)
      t = t.not_nil!.add(x)
    end
  end

  x.report("random insertion ary") do
    t = [] of Int32
    1_000_000.times do |x|
      t << x
    end

    probe.each_with_index do |x, i|
      t.insert(x, x)
    end
  end

  x.report("tape test") do
    t = Tape(Int32).new

    1_000_000.times do |x|
      t = t.add(x)
    end

    1_000_000.times do |x|
      t = t.to?(x).not_nil!
    end

    100_000.times do |x|
      t = t.to?(x).not_nil!
      t = t.add(x)
    end

    t = t.map! do |i|
      i + 1 if i.even?
    end

    t = t.to?(t.count).not_nil!

    t.count.times do |x|
      t = t.drop?.not_nil!
    end
  end

  x.report("array test") do
    a = [] of Int32
    ac = 0

    1_000_000.times do |x|
      a << x
      ac += 1
    end

    1_000_000.times do |x|
      ac = x
    end

    100_000.times do |x|
      ac = x
      a.insert(ac, x)
      ac += 1
    end

    a.map! do |i|
      i.even? ? i + 1 : i
    end

    ac = a.size

    a.size.times do |x|
      ac -= 1
      a.delete_at(ac)
    end
  end
end
                 tape  65.99m ( 15.15s ) (± 0.00%)  38.9GB/op           fastest
                array  59.73m ( 16.74s ) (± 0.00%)   9.9GB/op     1.10×  slower
random insertion tape   5.53m (180.92s ) (± 0.00%)   778GB/op  7775.89×  slower
 random insertion ary 125.91m (  7.94s ) (± 0.00%)  27.6MB/op   341.35×  slower
            tape test  42.98  ( 23.27ms) (± 0.68%)  25.0MB/op           fastest
           array test  44.63m ( 22.41s ) (± 0.00%)  27.6MB/op   963.12×  slower
homonoidian commented 1 year ago

Another approach would be to have a buffer and a claims array.

A claim is a range. Whenever a copy of substrate is made, a new claim is added to the claims array (or the refcount of an existing claim is incremented, if there is already a claim with the same begin&end).

A claim is basically the bounds of a particular substrate. Upon a mutation, claim conflict is checked (the way this is done depends on the kind of mutation, e.g., insert/push/map/etc.)

If a claim conflict is detected, refcount of the current claim is decremented; if its refcount = 0, then its is removed from the claims array. If conflict was detected, a copy of the buffer is made with the current claim attached. Buffer could (and probably should) be copied according to the bounds of the claim ± power 2 or smth. rather than in its entirety.

This allows to insert and drop in refs without copying in some cases -- that is, the one who is pushed first out of any number of same-bound copies wins in performance over everyone else. If you have two generations of substrates, A and its copy a, and they point to the same buffer, then N pushes to a is free (copy-less), and the N drops are free as well for a.

However, drops that conflict with A's claim (e.g. a drop when A = a) or inserts that conflict with A's claim (e.g. insert in the middle of a) make a personal copy of the buffer for a.

Note that A and a could be swapped in the explanation above.

I don't know how practical all of this would be for Novika in particular, though. Even with the most naive implementation I've tried writing, it seems to pay off only when having large and very large arrays, the current implementation degrading more and more up to points where the one I'm describing here is 20x or so faster. The benchmarks I've done are disputable, though, because I'm carefully designing them to hit the "sweet spot" rather than do dirty, massively repetitive work of everyday Novika code.

It should also be noted that the average size of a Novika substrate is ~3.56 elements (running the tests); 2793793 substrates are made. Small and heavily copied substrates rule in the Novika universe. Doing e.g. 100 enclose results in [] 100 << which does make a copy, but not for the most obvious reason. The fact is that it is created with an empty tape, a copy is made due to block open of enclose (which copies the entire definition of enclose including []), leaving the refsubstrate for [], and then a mutation is introduced, henceforth copying the empty substrate and inserting 100 into the copy. The approach I've described here seems to prevent a copy in this case, because it is completely unnecessary as no conflicts with other claims exist. ALL OF THIS IS DISPUTABLE THOUGH AND I'D NEED TO GIVE THIS A BIG LONG THOUGHT.

Also, Idk how easy it'd be to realloc etc.

Also, Idk how fast this'd be on real Novika code.

Also, I think there cannot exist a buffer with no claims. This is simply impossible or is an invalid state. Most often ,buffers have exactly one claim. We could introduce a staticarray of claims with some claim threshold after which new claims make a copy and attach to the copy, but this needs a live implementation where stuff like this could be tested & magic numbers found, not simplistic speculation as I do here, now. What I see most clearly is that we don't need to create a claims array initially -- only when more than one claim is introduced. Otherwise, we'll simply store the "master claim" or something like that and wait out until new claims arrive.

A lower-level optimization would be to split buffer into chunks and copy them only, but this'd require some mind-boggling pointer jugglery and linked-listery which does not necessarily result in good performance. Another idea is to store changes (mutations) themselves rather than make a full copy & apply them on it, e.g. an "invalid range" (from the viewpoint of a particular buffer) and a "replacement sub-buffer", used when operation index falls in the "invalid range" (this is kind of a mirror-side version of the linked-listery I was just talking about). This paragraph is about some very complex stuff though, and maybe I'm trying to be smart here more than practical...

What needs to be understood is that reducing the amount of block copies is one of the most important pillars in optimizing Novika. The less space is used, the faster are mallocs and reallocs. The less the blobs of stuff are, the faster the GC can look at them and free then. The less copying is done, the less time is spent on copying!

homonoidian commented 1 year ago

An illustration to the 2nd from the end paragraph.

Again, I have no idea whether this will work and how much of an effect it would result in. Maybe it won't work. I. Don't. Know.

Lady Gaga