raskr / rust-autograd

Tensors and differentiable operations (like TensorFlow) in Rust
MIT License
487 stars 37 forks source link

`cnn_mnist` and `lstm_lm` examples runtime failure #27

Closed quietlychris closed 4 years ago

quietlychris commented 4 years ago

Hi! I'm having some trouble with running a couple of the examples from your project. My process was the following:

$ git clone https://github.com/raskr/rust-autograd.git
$ cd rust-autograd/examples
$ ./download_mnist.sh
$ RUST_BACKTRACE=1 cargo run --example cnn_mnist

which results in the following error. I've compiled it both with and without the --features mkl flag, and with and without the --release flag, neither of which seems have any effect on these errors. As far as I can tell, they don't seem to be trace back to the same problem, but I may be mistaken. I'm running rustc 1.46.0-nightly (feb3536eb 2020-06-09), on Pop!_OS 20.04 LTS, which closely mirrors Ubuntu. If it would be easier for me to open separate issues for each of these, I would be happy to do so.

However, I am able to compile and run the mlp_mnist example without issue.

thread 'main' panicked at 'assertion failed: `(left == right)`
  left: `4`,
 right: `2`', /rustc/feb3536eba10c2e4585d066629598f03d5ddc7c6/src/libstd/macros.rs:16:9
stack backtrace:
   0: backtrace::backtrace::libunwind::trace
             at /cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.46/src/backtrace/libunwind.rs:86
   1: backtrace::backtrace::trace_unsynchronized
             at /cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.46/src/backtrace/mod.rs:66
   2: std::sys_common::backtrace::_print_fmt
             at src/libstd/sys_common/backtrace.rs:78
   3: <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt
             at src/libstd/sys_common/backtrace.rs:59
   4: core::fmt::write
             at src/libcore/fmt/mod.rs:1076
   5: std::io::Write::write_fmt
             at src/libstd/io/mod.rs:1537
   6: std::sys_common::backtrace::_print
             at src/libstd/sys_common/backtrace.rs:62
   7: std::sys_common::backtrace::print
             at src/libstd/sys_common/backtrace.rs:49
   8: std::panicking::default_hook::{{closure}}
             at src/libstd/panicking.rs:198
   9: std::panicking::default_hook
             at src/libstd/panicking.rs:218
  10: std::panicking::rust_panic_with_hook
             at src/libstd/panicking.rs:477
  11: rust_begin_unwind
             at src/libstd/panicking.rs:385
  12: std::panicking::begin_panic_fmt
             at src/libstd/panicking.rs:339
  13: ndarray::impl_methods::<impl ndarray::ArrayBase<S,D>>::slice_collapse
             at /rustc/feb3536eba10c2e4585d066629598f03d5ddc7c6/src/libstd/macros.rs:16
  14: ndarray::impl_methods::<impl ndarray::ArrayBase<S,D>>::slice_move
             at /home/chrism/.cargo/registry/src/github.com-1ecc6299db9ec823/ndarray-0.12.1/src/impl_methods.rs:325
  15: ndarray::impl_methods::<impl ndarray::ArrayBase<S,D>>::slice
             at /home/chrism/.cargo/registry/src/github.com-1ecc6299db9ec823/ndarray-0.12.1/src/impl_methods.rs:289
  16: cnn_mnist::main::{{closure}}
             at examples/cnn_mnist.rs:94
  17: autograd::graph::with
             at /home/chrism/rust-projects/rust-autograd/src/graph.rs:91
  18: cnn_mnist::main
             at examples/cnn_mnist.rs:66
  19: std::rt::lang_start::{{closure}}
             at /rustc/feb3536eba10c2e4585d066629598f03d5ddc7c6/src/libstd/rt.rs:67
  20: std::rt::lang_start_internal::{{closure}}
             at src/libstd/rt.rs:52
  21: std::panicking::try::do_call
             at src/libstd/panicking.rs:297
  22: std::panicking::try
             at src/libstd/panicking.rs:274
  23: std::panic::catch_unwind
             at src/libstd/panic.rs:394
  24: std::rt::lang_start_internal
             at src/libstd/rt.rs:51
  25: std::rt::lang_start
             at /rustc/feb3536eba10c2e4585d066629598f03d5ddc7c6/src/libstd/rt.rs:67
  26: main
  27: __libc_start_main
  28: _start
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

When running $ RUST_BACKTRACE=1 cargo run --example lstm_lm, I get the following error:

thread 'main' panicked at 'lhs input for MatMul must be 2D: ShapeError/IncompatibleShape: incompatible shapes', /home/chrism/rust-projects/rust-autograd/src/ops/dot_ops.rs:536:21
stack backtrace:
   0: backtrace::backtrace::libunwind::trace
             at /cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.46/src/backtrace/libunwind.rs:86
   1: backtrace::backtrace::trace_unsynchronized
             at /cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.46/src/backtrace/mod.rs:66
   2: std::sys_common::backtrace::_print_fmt
             at src/libstd/sys_common/backtrace.rs:78
   3: <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt
             at src/libstd/sys_common/backtrace.rs:59
   4: core::fmt::write
             at src/libcore/fmt/mod.rs:1076
   5: std::io::Write::write_fmt
             at src/libstd/io/mod.rs:1537
   6: std::sys_common::backtrace::_print
             at src/libstd/sys_common/backtrace.rs:62
   7: std::sys_common::backtrace::print
             at src/libstd/sys_common/backtrace.rs:49
   8: std::panicking::default_hook::{{closure}}
             at src/libstd/panicking.rs:198
   9: std::panicking::default_hook
             at src/libstd/panicking.rs:218
  10: std::panicking::rust_panic_with_hook
             at src/libstd/panicking.rs:477
  11: rust_begin_unwind
             at src/libstd/panicking.rs:385
  12: core::panicking::panic_fmt
             at src/libcore/panicking.rs:86
  13: core::option::expect_none_failed
             at src/libcore/option.rs:1272
  14: core::result::Result<T,E>::expect
             at /rustc/feb3536eba10c2e4585d066629598f03d5ddc7c6/src/libcore/result.rs:963
  15: <autograd::ops::dot_ops::MatMul as autograd::op::Op<T>>::compute
             at /home/chrism/rust-projects/rust-autograd/src/ops/dot_ops.rs:536
  16: autograd::runtime::<impl autograd::graph::Graph<F>>::eval::{{closure}}
             at /home/chrism/rust-projects/rust-autograd/src/runtime.rs:390
  17: core::result::Result<T,E>::and_then
             at /rustc/feb3536eba10c2e4585d066629598f03d5ddc7c6/src/libcore/result.rs:729
  18: autograd::runtime::<impl autograd::graph::Graph<F>>::eval
             at /home/chrism/rust-projects/rust-autograd/src/runtime.rs:388
  19: autograd::test_helper::check_theoretical_grads
             at /home/chrism/rust-projects/rust-autograd/src/test_helper.rs:21
  20: lstm_lm::main::{{closure}}
             at examples/lstm_lm.rs:94
  21: autograd::graph::with
             at /home/chrism/rust-projects/rust-autograd/src/graph.rs:91
  22: lstm_lm::main
             at examples/lstm_lm.rs:66
  23: std::rt::lang_start::{{closure}}
             at /rustc/feb3536eba10c2e4585d066629598f03d5ddc7c6/src/libstd/rt.rs:67
  24: std::rt::lang_start_internal::{{closure}}
             at src/libstd/rt.rs:52
  25: std::panicking::try::do_call
             at src/libstd/panicking.rs:297
  26: std::panicking::try
             at src/libstd/panicking.rs:274
  27: std::panic::catch_unwind
             at src/libstd/panic.rs:394
  28: std::rt::lang_start_internal
             at src/libstd/rt.rs:51
  29: std::rt::lang_start
             at /rustc/feb3536eba10c2e4585d066629598f03d5ddc7c6/src/libstd/rt.rs:67
  30: main
  31: __libc_start_main
  32: _start
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
raskr commented 4 years ago

@quietlychris

Oh sorry, that's a newly introduced bug in refactoring of examples... dataset code in example is correct for multi-layer perceptron example, but is not for convnet. That should be:

// x_train and x_test should be 4D (not 2D)
let x_train = as_arr(ndarray::IxDyn(&[num_image_train, 1, 28, 28]), train_x).unwrap();
let x_test = as_arr(ndarray::IxDyn(&[num_image_test, 1, 28, 28]), test_x).unwrap();

I'll fix it later!

quietlychris commented 4 years ago

Thanks for getting back to me so quickly! That's great to hear. I applied that fix to the mnist_data.rs file, and it definitely solved the cnn_mnist example, although I'm still seeing the same problem on the lstm_lm one, which I think ends up being due to a mismatch in the dimensions of the gradients and feed objects during the graph.eval() step in the test helper module. I haven't had time to put together a fix for it, but if I find one, I might submit a pull request.

Just as an aside, huge thanks for putting this crate together. I've been playing around with doing some machine learning in Rust using only linear algebra in a proof-of-concept here, but I'm seriously considering re-writing it so that it's more or less an API wrapper around autograd rather than continuing to re-invent the wheel, especially since I'm not sure I could match the quality of this code anyway.

quietlychris commented 4 years ago

Just as an update, I've been working on figuring out why this bug is appearing in the mlp_mnist example after that fix is applied, and it definitely looks like the issue is the same one that has appeared in lstm_lm.

The issue occurs because the

g.eval(update_ops, &[x.given(x_batch), y.given(y_batch)]);

line internally ends up calling the ops/dot_ops.rs function compute() which uses

let mut a = ctx
    .input(0)
    .into_dimensionality::<ndarray::Ix2>()
    .expect("lhs input for MatMul must be 2D");

Print debugging successfully gets me through to just before this function is called, and then ends up panicing because trying to convert (from the default batch-size of 200) 4D object of shape ctx.input(0).shape() = [200, 1, 28, 28] doesn't seem to work. I've tried other ndarray functions like broadcast(), but wasn't able to get any of them working.

The lstm_lm example has a three-dimensional input shape of ctx.input(0).shape() = [2,1,4], and panics at exactly the same spot.

Comparatively in the cnn_mnist example, the shape of ctx.input(0).shape() = [200, 3136] seems to work with the two-dimensional requirement and doesn't panic as a result.

Any suggestions on where to look for a fix?

raskr commented 4 years ago

@quietlychris Sorry for the late reply!

I've been working on figuring out why this bug is appearing in the mlp_mnist example

This is because mlp_mnist requires 2D inputs while cnn_mnist does 4D. I suggested to fix mnist_data.rs to return 4D, but 2D return value may more make because mnist is gray scale data (single channel) ... So I think it's preferable to reshape the 2D input images to (batch, 1, 28, 28) after this line using Array::into_shape.

For lstm, it's a simple index-out-of-range bug in this process (i+2 exceeds the max sentence length). Changing (0..max_sent) to (0..max_sent-1) should solve this problem!

While looking into these bugs, I found that weakness of this crate is its debuggability...

raskr commented 4 years ago

Fixed in v1.0.1 and master head.