modelfoxdotdev / modelfox

ModelFox makes it easy to train, deploy, and monitor machine learning models.
Other
1.46k stars 63 forks source link

Crash on classification with config file #98

Closed lnicola closed 2 years ago

lnicola commented 2 years ago
✅ Inferring train table columns. 2s
✅ Loading train table. 2s
✅ Loading test table. 5s
✅ Shuffling. 0s 628ms
✅ Computing train stats. 9s
✅ Computing test stats. 27s
✅ Finalizing stats. 16s
🏁 Computing baseline metrics. 212389 / 230150 92% 0s 15ms elapsed 0ms remaining
[=======================================================================>      ]
[Thread 0x7ffff7c7e640 (LWP 419555) exited]
thread panicked while panicking. aborting.

Thread 1 "tangram" received signal SIGILL, Illegal instruction.

#0  std::panicking::rust_panic_with_hook () at library/std/src/panicking.rs:621
#1  0x00005555572d34a0 in std::panicking::begin_panic_handler::{closure#0} () at library/std/src/panicking.rs:502
#2  0x00005555572d1944 in std::sys_common::backtrace::__rust_end_short_backtrace<std::panicking::begin_panic_handler::{closure#0}, !> () at library/std/src/sys_common/backtrace.rs:139
#3  0x00005555572d3409 in std::panicking::begin_panic_handler () at library/std/src/panicking.rs:498
#4  0x0000555555893a51 in core::panicking::panic_fmt () at library/core/src/panicking.rs:107
#5  0x0000555555893b43 in core::result::unwrap_failed () at library/core/src/result.rs:1613
#6  0x00005555558d6764 in core::result::Result::unwrap<(), std::sync::mpsc::SendError<core::option::Option<tangram_core::progress::ProgressEvent>>> () at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/core/src/result.rs:1295
#7  tangram::train::{impl#1}::drop () at crates/cli/train.rs:169
#8  0x00005555559319a9 in core::ptr::drop_in_place<tangram::train::ProgressThread> () at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/core/src/ptr/mod.rs:188
#9  core::ptr::drop_in_place<core::option::Option<tangram::train::ProgressThread>> () at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/core/src/ptr/mod.rs:188
#10 0x000055555593a651 in tangram::train::train::{closure#1} () at crates/cli/train.rs:117
#11 0x00005555558d5c27 in std::panicking::try::do_call<tangram::train::train::{closure#1}, core::result::Result<(tangram_core::model::Model, std::path::PathBuf), anyhow::Error>> () at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/panicking.rs:406
#12 std::panicking::try<core::result::Result<(tangram_core::model::Model, std::path::PathBuf), anyhow::Error>, tangram::train::train::{closure#1}> () at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/panicking.rs:370
#13 std::panic::catch_unwind<tangram::train::train::{closure#1}, core::result::Result<(tangram_core::model::Model, std::path::PathBuf), anyhow::Error>> () at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/panic.rs:133
#14 tangram::train::train () at crates/cli/train.rs:37
#15 0x000055555590c001 in tangram::main () at crates/cli/main.rs:170

I commented out the stuff in drop and got:

✅ Inferring train table columns. 0s 9ms
✅ Loading train table. 0s 11ms
✅ Loading test table. 0s 35ms
✅ Shuffling. 0s 3ms
✅ Computing train stats. 0s 24ms
✅ Computing test stats. 0s 77ms
✅ Finalizing stats. 0s 50ms
🏁 Computing baseline metrics. 218421 / 230150 95% 0s 15ms elapsed 0ms remaining
[==========================================================================>   ]
error: panicked at 'called `Result::unwrap()` on an `Err` value: SendError { .. }', crates/cli/train.rs:163:14
   0: tangram::train::train::{{closure}}
             at /home/grayshade/tangram/crates/cli/train.rs:34:40
   1: std::panicking::rust_panic_with_hook
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/panicking.rs:610:17
   2: std::panicking::begin_panic_handler::{{closure}}
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/panicking.rs:502:13
   3: std::sys_common::backtrace::__rust_end_short_backtrace
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/sys_common/backtrace.rs:139:18
   4: rust_begin_unwind
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/panicking.rs:498:5
   5: core::panicking::panic_fmt
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/core/src/panicking.rs:107:14
   6: core::result::unwrap_failed
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/core/src/result.rs:1613:5
   7: core::result::Result<T,E>::unwrap
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/core/src/result.rs:1295:23
      tangram::train::ProgressThread::send_progress_event
             at /home/grayshade/tangram/crates/cli/train.rs:159:3
      tangram::train::train::{{closure}}::{{closure}}
             at /home/grayshade/tangram/crates/cli/train.rs:96:5
   8: tangram_core::train::train_grid_item::{{closure}}
             at /home/grayshade/tangram/crates/core/train.rs:1031:3
   9: tangram_core::train::train_linear_regressor::{{closure}}
             at /home/grayshade/tangram/crates/core/train.rs:1284:3
  10: tangram_linear::multiclass_classifier::MulticlassClassifier::train
             at /home/grayshade/tangram/crates/linear/multiclass_classifier.rs:132:3
  11: tangram_core::train::train_linear_multiclass_classifier
             at /home/grayshade/tangram/crates/core/train.rs:1484:21
      tangram_core::train::train_model
             at /home/grayshade/tangram/crates/core/train.rs:1233:8
      tangram_core::train::train_grid_item
             at /home/grayshade/tangram/crates/core/train.rs:1030:27
      tangram_core::train::Trainer::train_grid::{{closure}}
             at /home/grayshade/tangram/crates/core/train.rs:252:5
      core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &mut F>::call_once
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/core/src/ops/function.rs:280:13
  12: core::option::Option<T>::map
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/core/src/option.rs:846:29
      <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::next
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/core/src/iter/adapters/map.rs:103:9
      <alloc::vec::Vec<T> as alloc::vec::spec_from_iter_nested::SpecFromIterNested<T,I>>::from_iter
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/alloc/src/vec/spec_from_iter_nested.rs:23:32
      <alloc::vec::Vec<T> as alloc::vec::spec_from_iter::SpecFromIter<T,I>>::from_iter
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/alloc/src/vec/spec_from_iter.rs:33:9
  13: <alloc::vec::Vec<T> as core::iter::traits::collect::FromIterator<T>>::from_iter
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/alloc/src/vec/mod.rs:2549:9
      core::iter::traits::iterator::Iterator::collect
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/core/src/iter/traits/iterator.rs:1745:9
      tangram_core::train::Trainer::train_grid
             at /home/grayshade/tangram/crates/core/train.rs:246:33
  14: tangram::train::train::{{closure}}
             at /home/grayshade/tangram/crates/cli/train.rs:100:33
  15: std::panicking::try::do_call
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/panicking.rs:406:40
      std::panicking::try
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/panicking.rs:370:19
      std::panic::catch_unwind
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/panic.rs:133:14
      tangram::train::train
             at /home/grayshade/tangram/crates/cli/train.rs:37:15
  16: tangram::main
             at /home/grayshade/tangram/crates/cli/main.rs:170:30
  17: core::ops::function::FnOnce::call_once
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/core/src/ops/function.rs:227:5
      std::sys_common::backtrace::__rust_begin_short_backtrace
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/sys_common/backtrace.rs:123:18
  18: std::rt::lang_start::{{closure}}
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/rt.rs:145:18
  19: core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/core/src/ops/function.rs:259:13
      std::panicking::try::do_call
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/panicking.rs:406:40
      std::panicking::try
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/panicking.rs:370:19
      std::panic::catch_unwind
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/panic.rs:133:14
      std::rt::lang_start_internal::{{closure}}
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/rt.rs:128:48
      std::panicking::try::do_call
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/panicking.rs:406:40
      std::panicking::try
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/panicking.rs:370:19
      std::panic::catch_unwind
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/panic.rs:133:14
      std::rt::lang_start_internal
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/rt.rs:128:20
  20: main
  21: __libc_start_call_main
  22: __libc_start_main@GLIBC_2.2.5
  23: _start
lnicola commented 2 years ago

At a first look, I don't see how the sender can be None on drop, but I'm probably missing something. Anyway, this happens on one dataset I have, if I specify both --file-train and --file-test and a config file.

isabella commented 2 years ago

And it is not happening when you just pass a --file-train and --file-test without a config file?

lnicola commented 2 years ago

Yeah, but then it trains the wrong thing (as per my previous question from today). I suppose there's something wrong with my config file. I have the same classes in both the training and validation file, at least.

isabella commented 2 years ago

It's just bizarre that its a SIGILL. What is the version of tangram that you are using?

lnicola commented 2 years ago

Both the latest release and a git build fail in the same way. SIGILL is an abort, because a thread panicked while panicking (I updated my original comment).

EDIT: I removed my comment below. The test command line was tangram train --file-train training_ss.csv --file-test validation_ss.csv -t CTnumL4A -o t.tangram --config config.json.

isabella commented 2 years ago

I meant the format of your config file shouldnt cause it. We can try and debug this over a video call. Can you join our discord https://discord.gg/fqyvVMsJ

isabella commented 2 years ago

No problem, I am able to reproduce this on my end. I'll start digging in and let you know what I find.

isabella commented 2 years ago

There is definitely an issue with the progress bar. I'm looking into this more. in the meantime, you can train a model by passing the flag --no-progress

isabella commented 2 years ago

Hi @lnicola I fixed the issue. The problem was that we were using the train_row_count as the total for the progress bar when it was in fact the test_row_count that was the total which caused a value that we assumed to be positive to be negative. The eta was negative and the following line caused the panic https://github.com/tangramdotdev/tangram/blob/47340b8de905399912dfb4d181e9a45025c403c8/crates/cli/train.rs#L605

The issue is fixed on the main branch. The same bug should have been hit with regression but because progress draws on a timer and the regression code path was faster, the progress bar didn't get a chance to draw and so that code path was not hit.

lnicola commented 2 years ago

Thanks, it's working now. I just had to make a small change for it to build:

diff --git i/crates/cli/train.rs w/crates/cli/train.rs
index 874d77c..cc7836b 100644
--- i/crates/cli/train.rs
+++ w/crates/cli/train.rs
@@ -97,8 +97,7 @@ pub fn train(args: TrainArgs) -> Result<()> {
                        }
                };
                let kill_chip = unsafe { ctrl_c::register_ctrl_c_handler()? };
-               let train_grid_item_outputs =
-                       trainer.train_grid(Some(kill_chip), &mut handle_progress_event)?;
+               let train_grid_item_outputs = trainer.train_grid(kill_chip, &mut handle_progress_event)?;
                unsafe { ctrl_c::unregister_ctrl_c_handler()? };
                if kill_chip.is_activated() {
                        if let Some(progress_thread) = progress_thread.as_mut() {
isabella commented 2 years ago

yes, my bad! Thank you :)