janestreet / async_unix

Jane Street Capital's asynchronous execution library (unix)
MIT License
33 stars 21 forks source link

Assert failure in Writer0.add_iovec #17

Closed brendanlong closed 4 years ago

brendanlong commented 5 years ago

Unfortunately our logs format this in an annoying way, but we're getting errors like this relatively often and I'm not sure what's going on or how we should track this down:

  | Sep 16 05:02:02.670 | ((pid 8) (thread_id 2)))))
  | Sep 16 05:02:02.670 | ("Caught by monitor (id 2)"))
  | Sep 16 05:02:02.670 | (open_flags (Full (Ok (wronly)))) (line_ending Unix))))
  | Sep 16 05:02:02.670 | (consumer_left Empty) (raise_when_consumer_leaves true)
  | Sep 16 05:02:02.670 | (bytes_seen 60899842) (too_old Empty))))
  | Sep 16 05:02:02.670 | (bytes_received_queue ()) (times_received_queue ())
  | Sep 16 05:02:02.670 | (bytes_received_at_now_minus_maximum_age 0)
  | Sep 16 05:02:02.670 | (((writer <opaque>) (maximum_age 2m)
  | Sep 16 05:02:02.670 | (check_buffer_age
  | Sep 16 05:02:02.670 | (num_producers_to_flush_at_close 0) (flush_at_shutdown_elt (<opaque>))
  | Sep 16 05:02:02.670 | (close_state Open) (close_finished Empty) (close_started Empty)
  | Sep 16 05:02:02.670 | (scheduled_bytes 272) (scheduled_back 272) (back 1073)
  | Sep 16 05:02:02.670 | (syscall Per_cycle) (bytes_received 61206728) (bytes_written 61205655)
  | Sep 16 05:02:02.670 | (background_writer_state Running) (background_writer_stopped Empty)
  | Sep 16 05:02:02.670 | (is_detached false))))
  | Sep 16 05:02:02.670 | ((name main) (here ()) (id 1) (has_seen_error false)
  | Sep 16 05:02:02.670 | (is_detached true))
  | Sep 16 05:02:02.670 | (((name (id 3)) (here ()) (id 3) (has_seen_error true)
  | Sep 16 05:02:02.670 | (inner_monitor
  | Sep 16 05:02:02.670 | (is_detached false))))
  | Sep 16 05:02:02.670 | ((name main) (here ()) (id 1) (has_seen_error false)
  | Sep 16 05:02:02.670 | (is_detached false))
  | Sep 16 05:02:02.670 | (((name (id 2)) (here ()) (id 2) (has_seen_error false)
  | Sep 16 05:02:02.670 | (monitor
  | Sep 16 05:02:02.670 | (close_finished Empty)))
  | Sep 16 05:02:02.670 | (watching_has_changed false) (num_active_syscalls 0)
  | Sep 16 05:02:02.670 | (watching ((read Not_watching) (write Not_watching)))
  | Sep 16 05:02:02.670 | (have_set_nonblock false) (state (Open Empty))
  | Sep 16 05:02:02.670 | ((file_descr 1) (info <stdout>) (kind Fifo) (supports_nonblock false)
  | Sep 16 05:02:02.670 | (fd
  | Sep 16 05:02:02.670 | ((id 0)
  | Sep 16 05:02:02.670 | (writer
  | Sep 16 05:02:02.670 | ("Writer error from inner_monitor" "Assert_failure src/writer0.ml:834:2"
  | Sep 16 05:02:02.670 | ((monitor.ml.Error
  | Sep 16 05:02:02.670 | ("unhandled exception"
  | Sep 16 05:02:02.670 | "unhandled exception in Async scheduler"
  | Sep 16 05:02:02.670 | (((pid 8) (thread_id 0)) "2019-09-16 09:02:02.668802059Z"

The assert on line 834 is https://github.com/janestreet/async_unix/blob/v0.11.0/src/writer0.ml#L834

let add_iovec t kind (iovec : _ IOVec.t) ~count_bytes_as_received =
  assert (t.scheduled_back = t.back);
  if count_bytes_as_received then (got_bytes t iovec.len);
  if not (is_stopped_permanently t)
  then (
    t.scheduled_bytes <- t.scheduled_bytes + iovec.len;
    Deque.enqueue_back t.scheduled (iovec, kind));
  assert (t.scheduled_back = t.back); (* <--- here *)

We're using v0.11 but this function seems unchanged since then. I could try moving to a later version if you think that would help, although the base64 stuff is going to make that really difficult for us :(

Any ideas what's wrong? Since it's an assert I'm guessing this is supposed to be impossible, but we could also be doing something wrong.

brendanlong commented 4 years ago

I looked into this more and the assert failure is impossible (the same assertion is repeated at the top of the file and neither of these variable change inside the function), so I assume it's related to a threading issue in our code.