`Block#at?(form)` is slowing down the interpreter

homonoidian commented 2 years ago

Tests flamegraph

I don't know how much I should trust flamegraph or whatever the tool is called, but this sounds like a huge at? problem to me. This particular flamegraph is of novika core tests.

This one is of playground which maps and times a lot as far as I remember. Almost 20% of the samples are in at?.

Playground flamegraph

This one is of snake. Almost 1 GB of perf data, and the sucker is at? again, almost 21% of the samples.

Snake flamegraph

Now, Entry classes exist on the heap and are not replaced with anything, if I remember correctly. They are the actual things that hold the value forms, not hashes. Hashes are used for an O(1) way to look up an Entry. So at? is optimizable by caching, I believe. Of course there are some edge cases, but I think they are solvable and the performance benefit would be great. Benchmarking

require "benchmark"

class Entry
  def initialize(@value : String)
  end

  def set(@value)
  end

  def get
    @value
  end
end

class A
  property! parent : A?
  getter table = { } of String => Entry

  def initialize(@parent)
  end

  def at?(name)
    table[name]? || parent?.try &.at?(name)
  end

  def at(k, v : Entry)
    at?(k).try &.set(v.get) || (table[k] = v)
  end
end

class B
  property! parent : B?
  getter table = { } of String => Entry

  @memo = {} of String => Entry

  def initialize(@parent)
  end

  def at?(name)
    @memo[name]? || (@memo[name] = parent?.try &.at?(name) || return)
  end

  def at(k, v : Entry)
    if e = at?(k)
      e.set(v.get)
    else
      table[k] = v
      @memo[k] = v
    end
  end
end

ra = A.new(nil)
ra.at("foo", Entry.new("hello world"))
1_000.times { ra = A.new(ra) }

rb = B.new(nil)
rb.at("foo", Entry.new("hello world"))
1_000.times { rb = B.new(rb) }

Benchmark.ips do |x|
  x.report("A at? foo") do
    ra.at?("foo")
    ra.at("foo", Entry.new("barrrr"))
  end
  x.report("B at? foo") do
    rb.at?("foo")
    rb.at("foo", Entry.new("barrrr"))
  end
end

... leads to these results on Ryzen 3 2200G:

A at? foo 158.14k (  6.32µs) (± 1.38%)  32.0B/op  207.69× slower
B at? foo  32.84M ( 30.45ns) (±16.03%)  32.0B/op         fastest

Note that the repeated execution of repeat blocks is expected. We're measuring the effect that caching has.

homonoidian commented 2 years ago

tests with valgrind

Profile

homonoidian commented 2 years ago

In general, the slowest parts are Crystal's GC (which is expected), Block#add (RealSubstrate#insert? -> allocation) Block#at?(form) (the topic of this issue), Block#instance (Array#dup -> allocation).

homonoidian commented 2 years ago

Another idea on this:

Maybe we can make use of some sort of superglobal table like {Block, Form} => Entry which we'd use to lookup stuff in any block. This'd require hashing blocks differently tho, based on their object id or something like that.

For this though, we'd also need to have a cache of the block parent tree or at least some sort of way to figure out which blocks we can look up stuff in (to emulate the linked list of blocks we have right now), and in which order. Rebuilding the parent tree on reparent/initialize into another superglobal hash table Block => Array(Block) where the array has parents of key Block from nearest to farthest is also useful. We also won't need to have property parent in block then, because all it's used for is at? and lookup (maybe we can emulate it). parent can just inspect the hash table. parent= do Hash#[]=, etc.

This should speed up the language immensely. The hash table should never be GCd too (if it is possible to somehow disable GC on a specific object; o/w we'd also need to use our own table data structure), because it's lifetime = always lifetime of the program. It'd be logical of Crystal to disable GC on things stored in constants because they'd never get out of scope, but I may not be aware of something.

Blocks also won't need to store table. This has its disadvantages because you can include ITable and have custom lookup mechanisms. We can mitigate that by delegating lookup to table, which would be a Value that holds a pointer to the block it's part of. It then can look up in the flat superglobal table however it wishes.

homonoidian commented 2 years ago

Unfortunately it doesn't seem that caching will help. There are just too many blocks created, and even without caching they're putting a lot of strain on the Crystal's GC.

So, adding cache slows down Novika (or I'm dumb and didn't implement the caching right). That feels a bit unnatural, but really, I suppose it's because lookup hierarchies are fairly short other than for recursive calls (and that can be optimized by just comparing prototypes and instructing instance to reparent to prototype instead of self; maybe that's why as? takes so much time).

There's a lot of "maybe"s though. I believe it's going to be better if I write some profiling tools for Novika, and use them to optimize the Novika code (which is pretty badly written btw). I do not have any ideas on how to optimize the Crystal code, other than rewriting it in C. But I don't want to rewrite it in C, and with very high probability it won't be too much faster. The set of ideas I chose for core Novika isn't super fast even in isolation.

Implementing a read-only eject should help, too, although currently it's not as widely used to impact performance.

I'm closing this for now, because again, I don't want to rewrite the code in C and I have no idea on how to optimize it right now. Maybe I shouldn't. Premature optimization ... You know how it goes.

novika-lang / novika

`Block#at?(form)` is slowing down the interpreter #14