NIF panics when producing too much input in too short of a time

hansihe commented 7 years ago

Not sure if there is any way to actually fix this. This is probably because Erlang has stdout set to non-blocking mode, and a buffer gets filled up somewhere.

scrogson commented 7 years ago

Can you paste some example code that produces this scenario?

hansihe commented 7 years ago

@scrogson

pub fn print_panic<'a>(env: NifEnv<'a>, _args: &Vec<NifTerm<'a>>) -> NifResult<NifTerm<'a>> {
    for _ in 0..1000 {
        println!("testtesttesttesttesttesttesttesttesttesttesttest");
    }

    Ok(true.encode(env))
}

scrogson commented 7 years ago

hmm...but this isn't real world use case, is it? Seems only useful for debugging purposes.

hansihe commented 7 years ago

Which is a very real world use case. I have hit this issue numerous times when writing different kinds of NIFs.

jorendorff commented 7 years ago

Yeah, this has hit me more than once, just trying to debug misbehaving tests.

Rustler can provide its own print! and println! macros that work around this.

zhaohansprt commented 7 years ago

erlnif have some api can provide async thread spawn which can aviod such situations that going to block the beam

jorendorff commented 7 years ago

Rustler has APIs like that too, but they don't do anything to solve this problem.

I still think just replacing print! and println! is the way to go here.

zhaohansprt commented 7 years ago

cool!

evnu commented 5 years ago

Using println! repeatedly is a costly operation, as a lock has to be acquired on each invocation. By acquiring the lock only once, this can possibly be worked around. The following works on my machine without a panic:

use rustler::{Encoder, NifResult};
use std::io::Write;
fn crash<'a>(env: Env<'a>, _: &[Term]) -> NifResult<Term<'a>> {
    let stdout = std::io::stdout();
    let mut lock = stdout.lock();
    for _ in 0..10000 {
        writeln!(&mut lock, "testtesttesttesttesttesttesttesttesttesttesttest");
    }

    Ok(true.encode(env))
}

Of course, this does not solve the problem, as simply increasing the iterations will still lead to panics. 100_000 iterations of the loop above result in a panic here.

shribe commented 4 years ago

FWIW, I just hit this problem with only a few KB, (one terminal screen full or so, no loops, just individual logging statements), and it cost me a day chasing down where my panic was coming from, as I assumed I had a logic error somewhere. As we may not have good source-level debugging tools for Rust code running inside the BEAM, early development can be highly-dependent on debug messages.

That said, there is a workaround to at least avoid the crash:

use gag::Redirect;
…
  let log = OpenOptions::new()
    .truncate(true)
    .read(true)
    .create(true)
    .write(true)
    .open("/var/log/…")
    .unwrap();

  let print_redirect = Redirect::stdout(log).unwrap();
  let pd_heap = Box::new(print_redirect);
  let _ = Box::into_raw(pd_heap);

This has the awkward effect of segregating output of IO.put/inspect from the NIF output, but it's better than mysterious panics.

evnu commented 4 years ago

I can trigger it reliably with https://github.com/evnu/rustler/commit/6d1ce19271193806f2a2270482c0e5eadbfee17c if I run it through rust-gdb and an ERTS Debug VM. When running the example directly with rust 1.40 and alacritty, the panic happens very rarely here.

The panic happens here, after the output should have been written to stdout:

    if let Err(e) = result {
        panic!("failed printing to {}: {}", label, e);
    }

Alas, I haven't been able to get the values of label or e. A stacktrace can be reproduced as follows. Note that I manually add the rust source as a directory in gdb.

$ rust-gdb --args bash $ERL_TOP/bin/cerl -kernel shell_history enabled -pa $(find /usr/lib/elixir/lib/*/ebin) -elixir ansi_enabled true -noshell -s elixir start_cli -- -extra /usr/bin/mix test test/panic_on_output_test.exs
(gdb) directory /home/evnu/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust
(gdb) b rust_panic
(gdb) run
...
line = 353428
line = 353429[Switching to Thread 0x7fffb4279700 (LWP 28195)]

Thread 7 "3_scheduler" hit Breakpoint 1, rust_panic () at src/libstd/panicking.rs:513
513         __rust_start_panic(obj as usize)
(gdb) bt
#0  rust_panic () at src/libstd/panicking.rs:513
....
#4  0x00007fff37d1a812 in std::io::stdio::print_to () at src/libstd/io/stdio.rs:792
#5  std::io::stdio::_print () at src/libstd/io/stdio.rs:802
#6  0x00007fff37cb8a9c in <rustler_test::test_panic::too_much_output as rustler::nif::Nif>::RAW_FUNC::nif_func::wrapper::{{closure}}::too_much_output (n=10000000) at rustler_tests/native/rustler_test/src/test_panic.rs:4
...
#10 0x00007fff37ca0ae7 in std::panicking::try (f=...) at /rustc/73528e339aae0f17a15ffa49a8ac608f50c6cf14/src/libstd/panicking.rs:265
#11 0x00007fff37c417a4 in std::panic::catch_unwind (f=...) at /rustc/73528e339aae0f17a15ffa49a8ac608f50c6cf14/src/libstd/panic.rs:396
#12 0x00007fff37cb891e in <rustler_test::test_panic::too_much_output as rustler::nif::Nif>::RAW_FUNC::nif_func::wrapper (env=..., args=...) at rustler_tests/native/rustler_test/src/test_panic.rs:1
#13 0x00007fff37cb8879 in <rustler_test::test_panic::too_much_output as
....

scrogson commented 4 years ago

@evnu can you swap the use of println! with env_logger and log crates and see if you can trigger a panic? So far I've had good luck with that combination in https://github.com/scrogson/franz

evnu commented 4 years ago

I added env_logger to the test with https://github.com/evnu/rustler/commit/ec6427d6ee1bf09263cd0c8bec7ba2e0a9430dd2. I can no longer reproduce the panic with gdb:

RUST_LOG=info rust-gdb --args bash /home/evnu/tools/otp/bin/cerl -kernel shell_history enabled -pa $(find /usr/lib/elixir/lib/*/ebin) -elixir ansi_enabled true -noshell -s elixir start_cli -- -extra /usr/bin/mix test test/panic_on_output_test.exs

evnu commented 4 years ago

I tried to get some more info using lldb. While I wasn't able to break at rust_panic, I got some more info on the specific error. Maybe this can help to track down the cause.

thread '<unnamed>' panicked at 'failed printing to stdout: Resource temporarily unavailable (os error 11)', src/libstd/io/stdio.rs:792:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace.

I ran lldb like this:

$ echo '/home/evnu/tools/otp/bin/cerl -kernel shell_history enabled -pa $(find /usr/lib/elixir/lib/*/ebin) -elixir ansi_enabled true -noshell -s elixir start_cli -- -extra /usr/bin/mix test test/panic_on_output_test.exs' > run.sh
$ rust-lldb -- bash run.sh

Error 11 = EAGAIN:

~/tools/rustler(72-panic-output ✗) errno 11
EAGAIN 11 Resource temporarily unavailable

evnu commented 1 year ago

Closing, I think we cannot do something here.

ishitatsuyuki commented 1 year ago

It looks like setting stdio as nonblocking is generally considered bad practice as it not only propagates to external libraries but also affects every program sharing the tty.

Turns out it surely had caused other complaints upstream for the reason I mentioned: https://github.com/erlang/otp/issues/3150. It looks like a new tty driver implementation is in the pipeline and will hopefully fix this issue in a future release: https://github.com/erlang/otp/commit/bd0865ff86f06bba13a99e38db5d1f0a124e5cb3.

rusterlium / rustler

NIF panics when producing too much input in too short of a time #72