oxidecomputer / steno

distributed sagas
Apache License 2.0
107 stars 10 forks source link

better log saga start and completion #286

Closed davepacheco closed 2 months ago

davepacheco commented 2 months ago

It would be helpful for consumers if Steno always clearly logged when any saga is started and finished. Then consumers don't have to have a task waiting for saga completions just to log the result. Steno already has the saga name and saga_id, so these log messages can be pretty complete.

davepacheco commented 2 months ago

For testing, I modified the demo example to set the log level to INFO:

diff --git a/examples/demo-provision.rs b/examples/demo-provision.rs
index d953eb7..7431ca0 100644
--- a/examples/demo-provision.rs
+++ b/examples/demo-provision.rs
@@ -68,7 +68,7 @@ fn make_saga_id() -> SagaId {
 fn make_log() -> slog::Logger {
     let decorator = slog_term::TermDecorator::new().build();
     let drain = slog_term::FullFormat::new(decorator).build().fuse();
-    let drain = slog::LevelFilter(drain, slog::Level::Warning).fuse();
+    let drain = slog::LevelFilter(drain, slog::Level::Info).fuse();
     let drain = slog_async::Async::new(drain).build().fuse();
     slog::Logger::root(drain, slog::o!())
 }

Here's a successful run:

dap@ivanova steno $ cargo run  --example=demo-provision -- run 
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.12s
     Running `target/debug/examples/demo-provision run`
Jun 28 21:52:22.579 INFO SEC running
Jun 28 21:52:22.579 INFO saga create, dag: {"end_node":13,"graph":{"edge_property":"directed","edges":[[0,1,null],[1,2,null],[1,3,null],[1,4,null],[5,6,null],[4,5,null],[6,7,null],[2,8,null],[3,8,null],[7,8,null],[8,9,null],[9,10,null],[10,11,null],[12,0,null],[11,13,null]],"node_holes":[],"nodes":[{"Action":{"action_name":"instance_create","label":"InstanceCreate","name":"instance_id"}},{"Constant":{"name":"server_alloc_params","value":{"number_of_things":1}}},{"Action":{"action_name":"vpc_alloc_ip","label":"VpcAllocIp","name":"instance_ip"}},{"Action":{"action_name":"volume_create","label":"VolumeCreate","name":"volume_id"}},{"SubsagaStart":{"params_node_name":"server_alloc_params","saga_name":"server-alloc"}},{"Action":{"action_name":"server_pick","label":"ServerPick","name":"server_id"}},{"Action":{"action_name":"server_reserve","label":"ServerReserve","name":"server_reserve"}},{"SubsagaEnd":{"name":"server_alloc"}},{"Action":{"action_name":"instance_configure","label":"InstanceConfigure","name":"instance_configure"}},{"Action":{"action_name":"volume_attach","label":"VolumeAttach","name":"volume_attach"}},{"Action":{"action_name":"instance_boot","label":"InstanceBoot","name":"instance_boot"}},{"Action":{"action_name":"print","label":"Print","name":"print"}},{"Start":{"params":{"instance_name":"fake-o instance","number_of_instances":1}}},"End"]},"saga_name":"DemoVmProvision","start_node":12}, saga_name: DemoVmProvision, saga_id: 049b2522-308d-442e-bc65-9bfaef863597
*** running saga ***
Jun 28 21:52:22.580 INFO saga start, saga_name: DemoVmProvision, saga_id: 049b2522-308d-442e-bc65-9bfaef863597
running action: InstanceCreate (instance name: fake-o instance)
running action: VolumeCreate
running action: VpcAllocIp
running action: ServerPick
running action: ServerReserve
running action: InstanceConfigure
running action: VolumeAttach
running action: InstanceBoot
running action: Print
printing final state:
  instance id: 1211
  IP address: 10.120.121.122
  volume id: 1213
  server id: 1212
Jun 28 21:52:22.583 INFO update for saga cached state, new_state: Done, saga_id: 049b2522-308d-442e-bc65-9bfaef863597
Jun 28 21:52:22.583 INFO saga finished, result: success, saga_name: DemoVmProvision, saga_id: 049b2522-308d-442e-bc65-9bfaef863597
*** finished saga ***

*** final state ***
+ saga execution: 049b2522-308d-442e-bc65-9bfaef863597
+-- done: (start node)
+-- done: InstanceCreate (produces "instance_id")
+-- done: (constant = {"number_of_things":1}) (produces "server_alloc_params")
+-- (parallel actions):
        +-- done: VpcAllocIp (produces "instance_ip")
        +-- done: VolumeCreate (produces "volume_id")
        +-- done: (subsaga start: "server-alloc")
                +-- done: ServerPick (produces "server_id")
                +-- done: ServerReserve (produces "server_reserve")
        +-- done: (subsaga end) (produces "server_alloc")
+-- done: InstanceConfigure (produces "instance_configure")
+-- done: VolumeAttach (produces "volume_attach")
+-- done: InstanceBoot (produces "instance_boot")
+-- done: Print (produces "print")
+-- done: (end node)

result: SUCCESS
final output: "it worked"
Jun 28 21:52:22.583 INFO initiating shutdown

Here's a normal failure (I'm going to start redirecting stdout so we focus on the log itself):

dap@ivanova steno $ cargo run  --example=demo-provision -- run --inject-error=volume_attach > /dev/null
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.09s
     Running `target/debug/examples/demo-provision run --inject-error=volume_attach`
Jun 28 21:54:13.400 INFO SEC running
Jun 28 21:54:13.401 INFO saga create, dag: {"end_node":13,"graph":{"edge_property":"directed","edges":[[0,1,null],[1,2,null],[1,3,null],[1,4,null],[5,6,null],[4,5,null],[6,7,null],[2,8,null],[3,8,null],[7,8,null],[8,9,null],[9,10,null],[10,11,null],[12,0,null],[11,13,null]],"node_holes":[],"nodes":[{"Action":{"action_name":"instance_create","label":"InstanceCreate","name":"instance_id"}},{"Constant":{"name":"server_alloc_params","value":{"number_of_things":1}}},{"Action":{"action_name":"vpc_alloc_ip","label":"VpcAllocIp","name":"instance_ip"}},{"Action":{"action_name":"volume_create","label":"VolumeCreate","name":"volume_id"}},{"SubsagaStart":{"params_node_name":"server_alloc_params","saga_name":"server-alloc"}},{"Action":{"action_name":"server_pick","label":"ServerPick","name":"server_id"}},{"Action":{"action_name":"server_reserve","label":"ServerReserve","name":"server_reserve"}},{"SubsagaEnd":{"name":"server_alloc"}},{"Action":{"action_name":"instance_configure","label":"InstanceConfigure","name":"instance_configure"}},{"Action":{"action_name":"volume_attach","label":"VolumeAttach","name":"volume_attach"}},{"Action":{"action_name":"instance_boot","label":"InstanceBoot","name":"instance_boot"}},{"Action":{"action_name":"print","label":"Print","name":"print"}},{"Start":{"params":{"instance_name":"fake-o instance","number_of_instances":1}}},"End"]},"saga_name":"DemoVmProvision","start_node":12}, saga_name: DemoVmProvision, saga_id: 049b2522-308d-442e-bc65-9bfaef863597
Jun 28 21:54:13.402 INFO saga start, saga_name: DemoVmProvision, saga_id: 049b2522-308d-442e-bc65-9bfaef863597
running action: InstanceCreate (instance name: fake-o instance)
running action: VolumeCreate
running action: ServerPick
running action: VpcAllocIp
running action: ServerReserve
running action: InstanceConfigure
Jun 28 21:54:13.404 INFO update for saga cached state, new_state: Unwinding, saga_id: 049b2522-308d-442e-bc65-9bfaef863597
Jun 28 21:54:13.405 INFO update for saga cached state, new_state: Done, saga_id: 049b2522-308d-442e-bc65-9bfaef863597
Jun 28 21:54:13.405 WARN saga finished, undo_result: success, action_error_source: InjectedError, action_error_node_name: "volume_attach", result: failure, saga_name: DemoVmProvision, saga_id: 049b2522-308d-442e-bc65-9bfaef863597
Jun 28 21:54:13.405 INFO initiating shutdown

and here's a saga that gets stuck after an action failure followed by an unwind failure:

dap@ivanova steno $ cargo run  --example=demo-provision -- run --inject-error=volume_attach --inject-undo-error=volume_id > /dev/null
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.11s
     Running `target/debug/examples/demo-provision run --inject-error=volume_attach --inject-undo-error=volume_id`
Jun 28 21:53:23.591 INFO SEC running
Jun 28 21:53:23.592 INFO saga create, dag: {"end_node":13,"graph":{"edge_property":"directed","edges":[[0,1,null],[1,2,null],[1,3,null],[1,4,null],[5,6,null],[4,5,null],[6,7,null],[2,8,null],[3,8,null],[7,8,null],[8,9,null],[9,10,null],[10,11,null],[12,0,null],[11,13,null]],"node_holes":[],"nodes":[{"Action":{"action_name":"instance_create","label":"InstanceCreate","name":"instance_id"}},{"Constant":{"name":"server_alloc_params","value":{"number_of_things":1}}},{"Action":{"action_name":"vpc_alloc_ip","label":"VpcAllocIp","name":"instance_ip"}},{"Action":{"action_name":"volume_create","label":"VolumeCreate","name":"volume_id"}},{"SubsagaStart":{"params_node_name":"server_alloc_params","saga_name":"server-alloc"}},{"Action":{"action_name":"server_pick","label":"ServerPick","name":"server_id"}},{"Action":{"action_name":"server_reserve","label":"ServerReserve","name":"server_reserve"}},{"SubsagaEnd":{"name":"server_alloc"}},{"Action":{"action_name":"instance_configure","label":"InstanceConfigure","name":"instance_configure"}},{"Action":{"action_name":"volume_attach","label":"VolumeAttach","name":"volume_attach"}},{"Action":{"action_name":"instance_boot","label":"InstanceBoot","name":"instance_boot"}},{"Action":{"action_name":"print","label":"Print","name":"print"}},{"Start":{"params":{"instance_name":"fake-o instance","number_of_instances":1}}},"End"]},"saga_name":"DemoVmProvision","start_node":12}, saga_name: DemoVmProvision, saga_id: 049b2522-308d-442e-bc65-9bfaef863597
Jun 28 21:53:23.593 INFO saga start, saga_name: DemoVmProvision, saga_id: 049b2522-308d-442e-bc65-9bfaef863597
running action: InstanceCreate (instance name: fake-o instance)
running action: VolumeCreate
running action: ServerPick
running action: VpcAllocIp
running action: ServerReserve
running action: InstanceConfigure
Jun 28 21:53:23.595 INFO update for saga cached state, new_state: Unwinding, saga_id: 049b2522-308d-442e-bc65-9bfaef863597
Jun 28 21:53:23.596 INFO update for saga cached state, new_state: Done, saga_id: 049b2522-308d-442e-bc65-9bfaef863597
Jun 28 21:53:23.596 ERRO saga finished, undo_error: PermanentFailure { source_error: Object {"message": String("undo action attempt 1: error injected")} }, undo_error_node_name: "volume_id", action_error_source: InjectedError, action_error_node_name: "volume_attach", result: failure, saga_name: DemoVmProvision, saga_id: 049b2522-308d-442e-bc65-9bfaef863597
Jun 28 21:53:23.597 INFO initiating shutdown

Here, I did a normal run and saved the saga log to a file:

dap@ivanova steno $ cargo run  --example=demo-provision -- run --dump-to=mylog > /dev/null
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.11s
     Running `target/debug/examples/demo-provision run --dump-to=mylog`
Jun 28 21:56:21.532 INFO SEC running
Jun 28 21:56:21.533 INFO saga create, dag: {"end_node":13,"graph":{"edge_property":"directed","edges":[[0,1,null],[1,2,null],[1,3,null],[1,4,null],[5,6,null],[4,5,null],[6,7,null],[2,8,null],[3,8,null],[7,8,null],[8,9,null],[9,10,null],[10,11,null],[12,0,null],[11,13,null]],"node_holes":[],"nodes":[{"Action":{"action_name":"instance_create","label":"InstanceCreate","name":"instance_id"}},{"Constant":{"name":"server_alloc_params","value":{"number_of_things":1}}},{"Action":{"action_name":"vpc_alloc_ip","label":"VpcAllocIp","name":"instance_ip"}},{"Action":{"action_name":"volume_create","label":"VolumeCreate","name":"volume_id"}},{"SubsagaStart":{"params_node_name":"server_alloc_params","saga_name":"server-alloc"}},{"Action":{"action_name":"server_pick","label":"ServerPick","name":"server_id"}},{"Action":{"action_name":"server_reserve","label":"ServerReserve","name":"server_reserve"}},{"SubsagaEnd":{"name":"server_alloc"}},{"Action":{"action_name":"instance_configure","label":"InstanceConfigure","name":"instance_configure"}},{"Action":{"action_name":"volume_attach","label":"VolumeAttach","name":"volume_attach"}},{"Action":{"action_name":"instance_boot","label":"InstanceBoot","name":"instance_boot"}},{"Action":{"action_name":"print","label":"Print","name":"print"}},{"Start":{"params":{"instance_name":"fake-o instance","number_of_instances":1}}},"End"]},"saga_name":"DemoVmProvision","start_node":12}, saga_name: DemoVmProvision, saga_id: 049b2522-308d-442e-bc65-9bfaef863597
Jun 28 21:56:21.533 INFO saga start, saga_name: DemoVmProvision, saga_id: 049b2522-308d-442e-bc65-9bfaef863597
running action: InstanceCreate (instance name: fake-o instance)
running action: VolumeCreate
running action: ServerPick
running action: VpcAllocIp
running action: ServerReserve
running action: InstanceConfigure
running action: VolumeAttach
running action: InstanceBoot
running action: Print
printing final state:
  instance id: 1211
  IP address: 10.120.121.122
  volume id: 1213
  server id: 1212
Jun 28 21:56:21.536 INFO update for saga cached state, new_state: Done, saga_id: 049b2522-308d-442e-bc65-9bfaef863597
Jun 28 21:56:21.536 INFO saga finished, result: success, saga_name: DemoVmProvision, saga_id: 049b2522-308d-442e-bc65-9bfaef863597
Jun 28 21:56:21.547 INFO initiating shutdown

and here I run recovery on it and resume:

dap@ivanova steno $ cargo run  --example=demo-provision -- run --recover-from=mylog > /dev/null
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.10s
     Running `target/debug/examples/demo-provision run --recover-from=mylog`
Jun 28 21:56:25.710 INFO SEC running
Jun 28 21:56:25.730 INFO saga resume, dag: {"end_node":13,"graph":{"edge_property":"directed","edges":[[0,1,null],[1,2,null],[1,3,null],[1,4,null],[5,6,null],[4,5,null],[6,7,null],[2,8,null],[3,8,null],[7,8,null],[8,9,null],[9,10,null],[10,11,null],[12,0,null],[11,13,null]],"node_holes":[],"nodes":[{"Action":{"action_name":"instance_create","label":"InstanceCreate","name":"instance_id"}},{"Constant":{"name":"server_alloc_params","value":{"number_of_things":1}}},{"Action":{"action_name":"vpc_alloc_ip","label":"VpcAllocIp","name":"instance_ip"}},{"Action":{"action_name":"volume_create","label":"VolumeCreate","name":"volume_id"}},{"SubsagaStart":{"params_node_name":"server_alloc_params","saga_name":"server-alloc"}},{"Action":{"action_name":"server_pick","label":"ServerPick","name":"server_id"}},{"Action":{"action_name":"server_reserve","label":"ServerReserve","name":"server_reserve"}},{"SubsagaEnd":{"name":"server_alloc"}},{"Action":{"action_name":"instance_configure","label":"InstanceConfigure","name":"instance_configure"}},{"Action":{"action_name":"volume_attach","label":"VolumeAttach","name":"volume_attach"}},{"Action":{"action_name":"instance_boot","label":"InstanceBoot","name":"instance_boot"}},{"Action":{"action_name":"print","label":"Print","name":"print"}},{"Start":{"params":{"instance_name":"fake-o instance","number_of_instances":1}}},"End"]},"saga_name":"DemoVmProvision","start_node":12}, saga_name: DemoVmProvision, saga_id: 049b2522-308d-442e-bc65-9bfaef863597
Jun 28 21:56:25.731 INFO saga start, saga_name: DemoVmProvision, saga_id: 049b2522-308d-442e-bc65-9bfaef863597
Jun 28 21:56:25.731 INFO update for saga cached state, new_state: Done, saga_id: 049b2522-308d-442e-bc65-9bfaef863597
Jun 28 21:56:25.731 INFO saga finished, result: success, saga_name: DemoVmProvision, saga_id: 049b2522-308d-442e-bc65-9bfaef863597
Jun 28 21:56:25.731 INFO initiating shutdown

What I'm looking for here is that for every saga run we have:

Together this means that if somebody's debugging from the log, they can reliably tell what sagas were run, with what arguments, how they finished, and timestamps for all of these.