imandra-ai / ocaml-opentelemetry

Instrumentation for https://opentelemetry.io
http://docs.imandra.ai/ocaml-opentelemetry/
33 stars 7 forks source link

Extract thread_local into separate library, abstract for lwt/eio-friendly continuation-local storage #34

Closed ELLIOTTCABLE closed 12 months ago

ELLIOTTCABLE commented 1 year ago

Breaking change: ocaml-trace versions before 0.4 are no longer compatible.


This PR makes two major changes:

  1. This replaces the Thread_local with ocaml-ambient-context, a library that:

    … provides a API that is type- and dependency-abstracted across thread-local storage, Lwt's sequence-associated storage, and Eio's https://github.com/ocaml-multicore/eio/pull/256 — all different approaches to "throwing" information "across the stack" at runtime, without modifying interleaving methods/components/dependencies.

  2. (breaking) Upgrades Opentelemetry_trace to the new (ambient-context compatible) API provided by trace.0.4.

"Ambient context" is, however, a leaky abstraction — it's not a straight drop-in, thoughtless replacement for TLS. This isn't a breaking change, because implicit-scope under Lwt was completely non-functional prior to this PR; but it does involve some user-education and documentation. (I've attempted to document the consequences of ambient-context throughout the README and docstrings; so hopefully, the PR is self-documenting.)

This replaces (and builds upon the lessons learned in) #32.

ELLIOTTCABLE commented 1 year ago

This should be ready for review, pending only upon me getting ambient-context pushed to the public opam repository.

tatchi commented 1 year ago

A little example I was playing around with stopped working as expected with this change. Here's the code

module Otel = Opentelemetry

let run () =
  Otel.Trace.with_ "first trace" @@ fun _scope ->
  Thread.delay 0.2;
  Otel.Trace.with_ "second trace" @@ fun _scope ->
  Thread.delay 0.2;
  Otel.Trace.with_ "third trace" @@ fun _scope ->
  Thread.delay 0.2;
  ()

let () =
  Otel.Globals.service_name := "my service";
  let config =
    Opentelemetry_client_ocurl.Config.make ~debug:true
      ~url:"http://localhost:4318" ()
  in
  Opentelemetry_client_ocurl.with_setup ~config () @@ fun () -> run ()

Spans used to be properly grouped before

Screenshot 2023-08-31 at 14 05 51

but with this change, here's what I get:

Screenshot 2023-08-31 at 14 03 37

I noticed that the scope we get from the Scope.get_ambient_scope was always missing (None).

After some debugging, here's the change that fixes it:

git diff src/opentelemetry.ml 
diff --git a/src/opentelemetry.ml b/src/opentelemetry.ml
index fac821c..15d33a4 100644
--- a/src/opentelemetry.ml
+++ b/src/opentelemetry.ml
@@ -740,7 +740,6 @@ module Trace = struct
     let span_id = Span_id.create () in
     let scope = { trace_id; span_id; events = []; attrs } in
     (* set global scope in this thread *)
-    Scope.with_ambient_scope scope @@ fun () ->
     (* called once we're done, to emit a span *)
     let finally res =
       let status =
@@ -759,7 +758,8 @@ module Trace = struct
       in
       emit ?service_name [ span ]
     in
-    let thunk () = cb scope in
+
+    let thunk () = Scope.with_ambient_scope scope @@ fun () -> cb scope in
     thunk, finally

   (** Sync span guard.

Here's a test that reproduces the issue

let%expect_test _ =
   let key = Ambient_context.create_key () in

   let get_ambient_context () = Ambient_context.get key in

   let cb () =
      print_endline
      @@
      match get_ambient_context () with
      | None -> "FAIL: None"
      | Some s -> Printf.sprintf "SUCCESS: %s" s
   in

   let with_' cb =
      Ambient_context.with_binding key "value" @@ fun () ->
      (* Set the scope *)
      let thunk () = cb () in
      thunk
      (* Remove the scope *)
   in
   let with_ cb =
      let thunk = with_' cb in
      (* Scope was removed before we execute the function *)
      let rv = thunk () in
      rv
   in

   with_ (fun () -> cb ()) ;

   [%expect {|
     FAIL: None |}]

So the problem is that we set the scope Ambient_context.with_binding in the with'_ function, but this function doesn't execute the cb but returns a function thunk that will execute it when called. And this thunk function is called later on when the scope is not available any more (it has already been removed).

IMO, this is a little subtle and can easily lead to unexpected results. I can't see how we could prevent that but I would still be curious to have your opinion

c-cube commented 1 year ago

nice catch. Can we turn this into a dune test (with a test stanza + .expected file)?

tatchi commented 1 year ago

nice catch. Can we turn this into a dune test (with a test stanza + .expected file)?

Sure, I'm travelling to ICFP next week, so I'm not sure if I'll have time. But happy to do it when I get back :)

ELLIOTTCABLE commented 1 year ago

I think what you ran into is actually a bug (well, logic error) in ambient-context, which I just caught in a lower-level test-case.

Basically, ambient-context wasn't (well, isn't, haven't fixed quite yet) playing well with sleeps: the lwt sequence-associated-storage is used to store the value, which would be fine, but TLS is being used to store "where do i look for the value" — and then by the time the get rolls around, that has been reset/restored to 'look for the value in TLS, not Lwt-SAS.' So it's retrieving a value of 'None.'

> [FAIL]        all             1   test sequenced storage.

┌──────────────────────────────────────────────────────────────────────────────┐
│ [FAIL]        all             1   test sequenced storage.                    │
└──────────────────────────────────────────────────────────────────────────────┘
with_storage_provider Storage_tls enter (previously Storage_lwt)
Storage_lwt: create_key 4
[1] set value to 'foo'
Storage_lwt: with_binding 4 enter
Storage_lwt: get 4 -> Some
[1] got value 'foo'
Storage_lwt: with_binding 4 exit
with_storage_provider Storage_tls exit (restoring Storage_lwt)
[2] set value to 'bar'
Storage_tls: with_binding 4 enter
Storage_tls: with_binding 4 exit
Storage_tls: get 4 -> None
[2] got value '<none>'
ASSERT second value retained
Storage_tls: get 4 -> None
[1] got value '<none>'
ASSERT first value retained
FAIL second value retained

   Expected: `Some "bar"'
   Received: `None'

Raised at Alcotest_engine__Test.check in file "src/alcotest-engine/test.ml", line 200, characters 4-261

l'oops. I think a fix for that will fix the issue you're seeing upstream here in ocaml-otel.

ELLIOTTCABLE commented 1 year ago

Okay, completely misunderstood the issue — it's indeed unrelated to ambient-context-lwt, your example doesn't even use Lwt.

I converted your test into something a little more automated (and much more ugly 😩.) This should also add a little infra for some more future testing, although I don't have the bandwidth to add that right this second.

Fix is also in.

tatchi commented 1 year ago

Nice, thanks for adding a test 🤗

ELLIOTTCABLE commented 1 year ago

With trace.0.4 (pending: ocaml/opam-repository#24487) and ambient-context.0.1 published to opam, this should be ready for a final review and merge. 🎉