gleam-lang / otp

📫 Fault tolerant multicore programs with actors
https://hexdocs.pm/gleam_otp/
Apache License 2.0
443 stars 49 forks source link

Actor performance degrades with Selector growth #75

Closed chouzar closed 1 month ago

chouzar commented 1 month ago

While trying to do some stress tests to my registry library chip I started to notice big slowdowns when the registry goes into the 100_000 monitored subjects mark; basic calls to the registry actor will start to fail due to timeouts.

To work chip builds its selector in-memory each time a new subject is registered, I tried to come up with the simplest example to test:

import gleam/erlang/process
import gleam/int
import gleam/io
import gleam/iterator
import gleam/option
import gleam/otp/actor

pub fn main() {
  io.println("Starting registry...")
  let assert Ok(registry) = start()

  io.println("Create and monitor 100_000 subjects...")
  iterator.range(from: 1, to: 100_000)
  |> iterator.each(fn(_) {
    let subject = process.new_subject()
    actor.send(registry, Monitor(subject))
  })

  io.println("Try to retrieve current value which is...")
  let value = actor.call(registry, CurrentValue(_), 100_000)
  io.println(int.to_string(value) <> "!")
}

type Message(msg) {
  Monitor(process.Subject(msg))
  Demonitor(process.ProcessMonitor)
  CurrentValue(process.Subject(Int))
}

type State(msg) {
  State(value: Int, selector: process.Selector(Message(msg)))
}

fn start() {
  actor.start_spec(actor.Spec(init: init, init_timeout: 10, loop: loop))
}

fn init() {
  let selector = process.new_selector()
  let state = State(0, selector)
  actor.Ready(state, selector)
}

fn loop(message: Message(msg), state: State(msg)) {
  case message {
    Monitor(subject) -> {
      let pid = process.subject_owner(subject)
      let monitor = process.monitor_process(pid)

      let on_process_down = fn(_: process.ProcessDown) { Demonitor(monitor) }

      let selector =
        state.selector
        |> process.selecting_process_down(monitor, on_process_down)

      let state = State(..state, selector: selector)
      actor.Continue(state, option.Some(selector))
    }

    Demonitor(monitor) -> {
      process.demonitor_process(monitor)
      actor.Continue(state, option.None)
    }

    CurrentValue(client) -> {
      process.send(client, state.value)
      actor.Continue(state, option.None)
    }
  }
}

When running the code above I'm trying to spawn 100_000 subjects so they can be monitored, then I just try to retrieve the actor's "value" which is always 0 non-changing through a normal call. These messages that are printed into the console:

➜  zilec git:(main) ✗ gleam run 
  Compiling zilec
   Compiled in 0.33s
    Running zilec.main
Starting registry...
Create and monitor 100_000 subjects...
Try to retrieve current value which is...
exception error: #{function => <<"call">>,line => 600,
                   message => <<"Assertion pattern match failed">>,
                   module => <<"gleam/erlang/process">>,
                   value => {error,call_timeout},
                   gleam_error => let_assert}
  in function  gleam@erlang@process:call/3 (/Users/chouzar/Bench/Playground/zilec/build/dev/erlang/gleam_erlang/_gleam_artefacts/gleam@erlang@process.erl, line 309)
  in call from zilec:main/0 (/Users/chouzar/Bench/Playground/zilec/build/dev/erlang/zilec/_gleam_artefacts/zilec.erl, line 77)%                                            

Incrementing the timeout to very big numbers like 10_000_000 will eventually return, but this is of course not ideal. I will try to look for other alternatives for building the selector but was wondering if there were any pointers into the behaviour of the actor, is this a design bug?

lpil commented 1 month ago

This is expected, yes. It's intended that you have a low number of handlers, similar to Erlang's receive expressions having a low number of clauses.

receive likely performs better with hundreds of thousands of clauses than a selector does, but unfortunately we don't have a way to generate receive expressions at runtime or in a type safe manner other than thought the selector abstraction.

chouzar commented 1 month ago

Appreciated! ✨

likely performs better with hundreds of thousands of clauses

I do recall doing some tests with maps and clauses and the performance also had issues with very big maps.

This is expected, yes. It's intended that you have a low number of handlers ...

Got it! Either I will design with this limitation in mind or try to find a workaround.

Will close this issue. Thank you so much!

chouzar commented 1 month ago

After working more on the benchmark itself, it seems that the problem was the way I was queuing up messages in the actor (a large 5000+ message queue would make it very slow for some reason), by throttling the messages I'm able to get decent response times for an in-memory store:

Name                    ips        average  deviation         median         99th %
chip.find          462.84 K        2.16 μs   ±433.02%        1.96 μs        4.58 μs

Still I will benchmark against an alternative, use the select anything helper to not have to build up the selector, but also, to not keep it in memory.