Closed ELLIOTTCABLE closed 12 months ago
This should be ready for review, pending only upon me getting ambient-context
pushed to the public opam repository.
A little example I was playing around with stopped working as expected with this change. Here's the code
module Otel = Opentelemetry
let run () =
Otel.Trace.with_ "first trace" @@ fun _scope ->
Thread.delay 0.2;
Otel.Trace.with_ "second trace" @@ fun _scope ->
Thread.delay 0.2;
Otel.Trace.with_ "third trace" @@ fun _scope ->
Thread.delay 0.2;
()
let () =
Otel.Globals.service_name := "my service";
let config =
Opentelemetry_client_ocurl.Config.make ~debug:true
~url:"http://localhost:4318" ()
in
Opentelemetry_client_ocurl.with_setup ~config () @@ fun () -> run ()
Spans used to be properly grouped before
but with this change, here's what I get:
I noticed that the scope
we get from the Scope.get_ambient_scope
was always missing (None).
After some debugging, here's the change that fixes it:
git diff src/opentelemetry.ml
diff --git a/src/opentelemetry.ml b/src/opentelemetry.ml
index fac821c..15d33a4 100644
--- a/src/opentelemetry.ml
+++ b/src/opentelemetry.ml
@@ -740,7 +740,6 @@ module Trace = struct
let span_id = Span_id.create () in
let scope = { trace_id; span_id; events = []; attrs } in
(* set global scope in this thread *)
- Scope.with_ambient_scope scope @@ fun () ->
(* called once we're done, to emit a span *)
let finally res =
let status =
@@ -759,7 +758,8 @@ module Trace = struct
in
emit ?service_name [ span ]
in
- let thunk () = cb scope in
+
+ let thunk () = Scope.with_ambient_scope scope @@ fun () -> cb scope in
thunk, finally
(** Sync span guard.
Here's a test that reproduces the issue
let%expect_test _ =
let key = Ambient_context.create_key () in
let get_ambient_context () = Ambient_context.get key in
let cb () =
print_endline
@@
match get_ambient_context () with
| None -> "FAIL: None"
| Some s -> Printf.sprintf "SUCCESS: %s" s
in
let with_' cb =
Ambient_context.with_binding key "value" @@ fun () ->
(* Set the scope *)
let thunk () = cb () in
thunk
(* Remove the scope *)
in
let with_ cb =
let thunk = with_' cb in
(* Scope was removed before we execute the function *)
let rv = thunk () in
rv
in
with_ (fun () -> cb ()) ;
[%expect {|
FAIL: None |}]
So the problem is that we set the scope Ambient_context.with_binding
in the with'_
function, but this function doesn't execute the cb
but returns a function thunk
that will execute it when called. And this thunk
function is called later on when the scope
is not available any more (it has already been removed).
IMO, this is a little subtle and can easily lead to unexpected results. I can't see how we could prevent that but I would still be curious to have your opinion
nice catch. Can we turn this into a dune test (with a test stanza + .expected file)?
nice catch. Can we turn this into a dune test (with a test stanza + .expected file)?
Sure, I'm travelling to ICFP next week, so I'm not sure if I'll have time. But happy to do it when I get back :)
I think what you ran into is actually a bug (well, logic error) in ambient-context
, which I just caught in a lower-level test-case.
Basically, ambient-context
wasn't (well, isn't, haven't fixed quite yet) playing well with sleeps: the lwt sequence-associated-storage is used to store the value, which would be fine, but TLS is being used to store "where do i look for the value" — and then by the time the get rolls around, that has been reset/restored to 'look for the value in TLS, not Lwt-SAS.' So it's retrieving a value of 'None.'
> [FAIL] all 1 test sequenced storage.
┌──────────────────────────────────────────────────────────────────────────────┐
│ [FAIL] all 1 test sequenced storage. │
└──────────────────────────────────────────────────────────────────────────────┘
with_storage_provider Storage_tls enter (previously Storage_lwt)
Storage_lwt: create_key 4
[1] set value to 'foo'
Storage_lwt: with_binding 4 enter
Storage_lwt: get 4 -> Some
[1] got value 'foo'
Storage_lwt: with_binding 4 exit
with_storage_provider Storage_tls exit (restoring Storage_lwt)
[2] set value to 'bar'
Storage_tls: with_binding 4 enter
Storage_tls: with_binding 4 exit
Storage_tls: get 4 -> None
[2] got value '<none>'
ASSERT second value retained
Storage_tls: get 4 -> None
[1] got value '<none>'
ASSERT first value retained
FAIL second value retained
Expected: `Some "bar"'
Received: `None'
Raised at Alcotest_engine__Test.check in file "src/alcotest-engine/test.ml", line 200, characters 4-261
l'oops. I think a fix for that will fix the issue you're seeing upstream here in ocaml-otel.
Okay, completely misunderstood the issue — it's indeed unrelated to ambient-context-lwt
, your example doesn't even use Lwt.
I converted your test into something a little more automated (and much more ugly 😩.) This should also add a little infra for some more future testing, although I don't have the bandwidth to add that right this second.
Fix is also in.
Nice, thanks for adding a test 🤗
With trace.0.4
(pending: ocaml/opam-repository#24487) and ambient-context.0.1
published to opam, this should be ready for a final review and merge. 🎉
Breaking change:
ocaml-trace
versions before0.4
are no longer compatible.This PR makes two major changes:
This replaces the
Thread_local
withocaml-ambient-context
, a library that:(breaking) Upgrades
Opentelemetry_trace
to the new (ambient-context compatible) API provided bytrace.0.4
."Ambient context" is, however, a leaky abstraction — it's not a straight drop-in, thoughtless replacement for TLS. This isn't a breaking change, because implicit-scope under Lwt was completely non-functional prior to this PR; but it does involve some user-education and documentation. (I've attempted to document the consequences of
ambient-context
throughout the README and docstrings; so hopefully, the PR is self-documenting.)This replaces (and builds upon the lessons learned in) #32.