janestreet / memtrace

Streaming client for OCaml's Memprof
MIT License
65 stars 13 forks source link

Support 5.3 statmemprof #22

Open tmcgilchrist opened 7 months ago

tmcgilchrist commented 7 months ago

This PR makes the minimal amount of changes to support https://github.com/ocaml/ocaml/pull/12923

I want to fix the raw_backtrace code to use the new functions in https://github.com/ocaml/ocaml/pull/9663

tmcgilchrist commented 6 months ago

Removing the use of Obj.magic ends up being rather awkward, leaving it as is.

ncik-roberts commented 5 months ago

@stedolan Does it make sense to merge this?

talex5 commented 3 months ago

I tried this (with the 5.2 backport):

$ opam switch create 5.2.0-statmemprof 5.2.0+statmemprof
$ opam pin memtrace.dev 'https://github.com/dra27/memtrace.git#5_1_statmemprof'

Running it on my program then failed when using 7 worker domains:

$ MEMTRACE=solver.ctf ./_build/default/stress/stress.exe --count=3 --internal-workers=7
Running in domain mode
[0install-solver.2.18; base-bigarray.base; base-domains.base; base-nnp.base;
 base-threads.base; base-unix.base; dune.3.16.0; host-arch-x86_64.1;
 host-system-other.1; ocaml.5.3.0; ocaml-config.3;
 ocaml-variants.5.3.0+trunk]

Solved warm-up request in: 1.39s
Running another 3 * 7 solves...
local: internal error, uncaught exception:
       File "src/memprof_tracer.ml", line 37, characters 2-8: Assertion failed

(code is at https://github.com/talex5/solver-service/tree/memtrace)

It did record a trace successfully with 3 domains. But then, the viewer wouldn't read it:

$ memtrace-viewer solver.ctf
Processing solver.ctf...
(monitor.ml.Error "Assert_failure src/location_codec.ml:27:17"
 ("Raised at Memtrace__Location_codec.Mtf_table.swap in file \"src/location_codec.ml\", line 27, characters 17-29"
  "Called from Memtrace__Location_codec.Mtf_table.decode in file \"src/location_codec.ml\", line 59, characters 6-16"
  "Called from Memtrace__Location_codec.Reader.get_location.(fun) in file \"src/location_codec.ml\", lines 163-175, characters 8-23"
  "Called from Stdlib__List.init in file \"list.ml\", line 64, characters 25-28"
  "Called from Memtrace__Location_codec.Reader.get_location in file \"src/location_codec.ml\", lines 150-182, characters 15-66"
  "Called from Memtrace__Trace.iter.iter_events_of_packet in file \"src/trace.ml\", line 699, characters 24-71"
  "Called from Memtrace__Trace.iter.iter_packets in file \"src/trace.ml\", line 743, characters 8-50"
  "Called from Memtrace_viewer_native__Raw_trace.iter in file \"server/src/raw_trace.ml\", lines 89-93, characters 2-17"
  "Called from Memtrace_viewer_native__Filtered_trace.iter_and_return_interpreter in file \"server/src/filtered_trace.ml\", lines 765-766, characters 2-55"
  "Called from Memtrace_viewer_native__Filtered_trace.iter in file \"server/src/filtered_trace.ml\" (inlined), line 771, characters 20-82"
  "Called from Memtrace_viewer_native__Graph.full_graph_and_max_time in file \"server/src/graph.ml\", lines 12-23, characters 2-45"
  "Called from Memtrace_viewer_native__Graph.build in file \"server/src/graph.ml\", line 57, characters 29-59"
  "Called from Memtrace_viewer_native__User_state.Initial.of_trace in file \"server/src/user_state.ml\", line 32, characters 16-66"
  "Called from Memtrace_viewer_native.main in file \"server/src/memtrace_viewer_native.ml\", line 72, characters 22-55"
  "Called from Async_kernel__Monitor.Exported_for_scheduler.schedule'.upon_work_fill_i in file \"src/monitor.ml\", line 293, characters 42-51"
  "Called from Async_kernel__Job_queue.run_jobs in file \"src/job_queue.ml\", line 180, characters 6-47"))

Is there something else I need to do to make this work?

OlivierNicole commented 2 months ago

I also got the same assert failure when trying to view a trace created using this fork. I created an issue at janestreet/memtrace_viewer#14. It’s unclear to me whether the problem lies in this PR or in memtrace_viewer.

tmcgilchrist commented 2 months ago

I can reproduce both failures using OCaml 5.2.0 with base.v0.16.3 on macOS (ARM64).

This one for captures that errors when running on multiple domains:

$ OCAMLRUNPARAM=b MEMTRACE=solver.ctf ./_build/default/stress/stress.exe --count=3 --internal-workers=4
Running in domain mode
[0install-solver.2.18; base-bigarray.base; base-domains.base; base-nnp.base;
 base-threads.base; base-unix.base; dune.3.16.0; host-arch-x86_64.1;
 host-system-other.1; ocaml.5.3.0; ocaml-config.3;
 ocaml-variants.5.3.0+trunk]

Solved warm-up request in: 2.14s
Running another 3 * 4 solves...

local: internal error, uncaught exception:
       File "memtrace/src/memprof_tracer.ml", line 37, characters 2-8: Assertion failed
       Raised at Stdlib__Domain.join in file "domain.ml", line 296, characters 16-24
       Called from Stdlib__List.iter in file "list.ml", line 112, characters 12-15
       Called from Dune__exe__Stress.main in file "stress/stress.ml", line 44, characters 2-35
       Called from Cmdliner_term.app.(fun) in file "cmdliner_term.ml", line 24, characters 19-24
       Called from Cmdliner_eval.run_parser in file "cmdliner_eval.ml", line 35, characters 37-44

And the memtrace-viewer error:

$ dune exec memtrace-viewer  solver.ctf
Processing solver.ctf...               
(monitor.ml.Error "Assert_failure memtrace/src/location_codec.ml:27:17"
 ("Raised at Memtrace__Location_codec.Mtf_table.swap in file \"memtrace/src/location_codec.ml\", line 27, characters 17-29"
  "Called from Memtrace__Location_codec.Mtf_table.decode in file \"memtrace/src/location_codec.ml\", line 59, characters 6-16"
  "Called from Memtrace__Location_codec.Reader.get_location.(fun) in file \"memtrace/src/location_codec.ml\", lines 177-181, characters 8-22"
  "Called from Stdlib__List.init in file \"list.ml\", line 64, characters 25-28"
  "Called from Memtrace__Location_codec.Reader.get_location in file \"memtrace/src/location_codec.ml\", lines 150-182, characters 15-66"
  "Called from Memtrace__Trace.iter.iter_events_of_packet in file \"memtrace/src/trace.ml\", line 699, characters 24-71"
  "Called from Memtrace__Trace.iter.iter_packets in file \"memtrace/src/trace.ml\", line 743, characters 8-50"
  "Called from Memtrace_viewer_native__Raw_trace.iter in file \"memtrace_viewer/server/src/raw_trace.ml\", lines 89-93, characters 2-17"
  "Called from Memtrace_viewer_native__Filtered_trace.iter_and_return_interpreter in file \"memtrace_viewer/server/src/filtered_trace.ml\", lines 765-766, characters 2-55"
  "Called from Memtrace_viewer_native__Filtered_trace.iter in file \"memtrace_viewer/server/src/filtered_trace.ml\", line 771, characters 20-82"
  "Called from Memtrace_viewer_native__Graph.full_graph_and_max_time in file \"memtrace_viewer/server/src/graph.ml\", lines 12-23, characters 2-45"
  "Called from Memtrace_viewer_native__Graph.build in file \"memtrace_viewer/server/src/graph.ml\", line 57, characters 29-59"
  "Called from Memtrace_viewer_native__User_state.Initial.of_trace in file \"memtrace_viewer/server/src/user_state.ml\", line 32, characters 16-66"
  "Called from Memtrace_viewer_native.main in file \"memtrace_viewer/server/src/memtrace_viewer_native.ml\", line 72, characters 22-55"
  "Called from Async_kernel__Monitor.Exported_for_scheduler.schedule'.upon_work_fill_i in file \"src/monitor.ml\", line 293, characters 42-51"
  "Called from Async_kernel__Job_queue.run_jobs in file \"src/job_queue.ml\", line 180, characters 6-47"))
tmcgilchrist commented 2 months ago

TL;DR Recording single domain traces seems fine, handling multiple domains requires a new trace file version.

Reporting what I've found so far in long form.

The error recording a trace file in File "memtrace/src/memprof_tracer.ml", line 37, characters 2-8: Assertion failed is caused by concurrent access to the internal mutable data structures at src/memprof_tracer.ml#L1-L8. When running across multiple domains the callbacks can run on different domains and try to access/modify that structure putting it into an inconsistent state. Wrapping it with a mutex or Kcas like https://github.com/tmcgilchrist/memtrace/commit/d0cc680f1a310f16b7f8591ad4e5bf0570bd0fd6 will avoid the error. However the current file format for memtrace is not designed for recording multiple domains with differing time stamps. The file reader asserts certain monotonic timestamp properties that do not hold. On a simpler test program that spawn domains like

(* fib_par.ml *)
let num_domains = try int_of_string Sys.argv.(1) with _ -> 1
let n = try int_of_string Sys.argv.(2) with _ -> 1

(* Sequential Fibonacci *)
let rec fib n =
  if n < 2 then 1 else fib (n - 1) + fib (n - 2)

module T = Domainslib.Task

let rec fib_par pool n =
  let _ = Buffer.create 10000 in
  if n > 20 then begin
      let a = T.async pool (fun _ -> fib_par pool (n-1)) in
      let b = T.async pool (fun _ -> fib_par pool (n-2)) in
      T.await pool a + T.await pool b
    end else
    (* Call sequential Fibonacci if the available work is small *)
    fib n

let main () =
  Memtrace.trace_if_requested ~context:"fib" ();
  let pool = T.setup_pool ~num_domains:(num_domains - 1) () in
  let res = T.run pool (fun _ -> fib_par pool n) in
  T.teardown_pool pool;
  Printf.printf "fib(%d) = %d\n" n res

let _ = main ()

I can generate traces and read them successfully, but I think that is just good luck.

Limiting the trace capture to a single domain usually yields a valid trace file that can be opened with memtrace_viewer. This capture from running MEMTRACE=solver.ctf ./_build/default/stress/stress.exe --count=30 --internal-workers=1

Screenshot 2024-08-02 at 10 59 58 AM

The second error in "Assert_failure memtrace/src/location_codec.ml:27:17" is caused by missing backtraces when recording the Move-to-front encoding. I haven't found the root cause for that yet, but if you creatively relax the file read constraints https://github.com/tmcgilchrist/memtrace/commit/06154cdeb81851157ab5e173b68a267936e37c6a you can read @OlivierNicole's file

Screenshot 2024-08-02 at 11 12 30 AM
lukemaurer commented 2 weeks ago

The second error in "Assert_failure memtrace/src/location_codec.ml:27:17" is caused by missing backtraces when recording the Move-to-front encoding. I haven't found the root cause for that yet, but if you creatively relax the file read constraints [tmcgilchrist@06154cd]

There was indeed a bug (fixed by #23) involving the MTF cache, and having looked at @OlivierNicole's trace I think it was at play there. @stedolan will know more than me, but I imagine for multi-domain support we'll just need a proper mutex instead of this locked flag in memprof_tracer.ml. Then timestamp monotonicity shouldn't be an issue because taking the timestamp and adding the sample to the trace happen in one atomic operation.

tmcgilchrist commented 2 weeks ago

Thanks @lukemaurer I'll pick this back up and add a proper mutex (rather than using kcas) around things. It would be useful to have this fully working and tested for the official 5.3 release.

tmcgilchrist commented 1 week ago

3789a1f puts a single Mutex around writing the CTF file and removes the two mutable fields locked and locked_ext. That (plus #23) is giving me consistent CTF files across multiple domains. I don't have an application that uses the ext_alloc and ext_free functions so I could be missing a use case there.

Running @talex5 's example above from solver-service as:

# Run with 6 domains
 MEMTRACE=solver-mutex-6-opam.2.3.0.ctf ./_build/default/stress/stress.exe local --count=120 --internal-workers=6 --cache-dir=./cache -q 

is working.

Note, since this is using a Mutex there might be more blocking occurring than you would want. The upside is it'll be consistent across 4.14 and 5.3.