modelfoxdotdev / modelfox

ModelFox makes it easy to train, deploy, and monitor machine learning models.
1.46k stars 63 forks source link

Crash on classification with config file #98

Closed lnicola closed 2 years ago

lnicola commented 2 years ago
✅ Inferring train table columns. 2s
✅ Loading train table. 2s
✅ Loading test table. 5s
✅ Shuffling. 0s 628ms
✅ Computing train stats. 9s
✅ Computing test stats. 27s
✅ Finalizing stats. 16s
🏁 Computing baseline metrics. 212389 / 230150 92% 0s 15ms elapsed 0ms remaining
[=======================================================================>      ]
[Thread 0x7ffff7c7e640 (LWP 419555) exited]
thread panicked while panicking. aborting.

Thread 1 "tangram" received signal SIGILL, Illegal instruction.

#0  std::panicking::rust_panic_with_hook () at library/std/src/
#1  0x00005555572d34a0 in std::panicking::begin_panic_handler::{closure#0} () at library/std/src/
#2  0x00005555572d1944 in std::sys_common::backtrace::__rust_end_short_backtrace<std::panicking::begin_panic_handler::{closure#0}, !> () at library/std/src/sys_common/
#3  0x00005555572d3409 in std::panicking::begin_panic_handler () at library/std/src/
#4  0x0000555555893a51 in core::panicking::panic_fmt () at library/core/src/
#5  0x0000555555893b43 in core::result::unwrap_failed () at library/core/src/
#6  0x00005555558d6764 in core::result::Result::unwrap<(), std::sync::mpsc::SendError<core::option::Option<tangram_core::progress::ProgressEvent>>> () at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/core/src/
#7  tangram::train::{impl#1}::drop () at crates/cli/
#8  0x00005555559319a9 in core::ptr::drop_in_place<tangram::train::ProgressThread> () at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/core/src/ptr/
#9  core::ptr::drop_in_place<core::option::Option<tangram::train::ProgressThread>> () at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/core/src/ptr/
#10 0x000055555593a651 in tangram::train::train::{closure#1} () at crates/cli/
#11 0x00005555558d5c27 in std::panicking::try::do_call<tangram::train::train::{closure#1}, core::result::Result<(tangram_core::model::Model, std::path::PathBuf), anyhow::Error>> () at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/
#12 std::panicking::try<core::result::Result<(tangram_core::model::Model, std::path::PathBuf), anyhow::Error>, tangram::train::train::{closure#1}> () at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/
#13 std::panic::catch_unwind<tangram::train::train::{closure#1}, core::result::Result<(tangram_core::model::Model, std::path::PathBuf), anyhow::Error>> () at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/
#14 tangram::train::train () at crates/cli/
#15 0x000055555590c001 in tangram::main () at crates/cli/

I commented out the stuff in drop and got:

✅ Inferring train table columns. 0s 9ms
✅ Loading train table. 0s 11ms
✅ Loading test table. 0s 35ms
✅ Shuffling. 0s 3ms
✅ Computing train stats. 0s 24ms
✅ Computing test stats. 0s 77ms
✅ Finalizing stats. 0s 50ms
🏁 Computing baseline metrics. 218421 / 230150 95% 0s 15ms elapsed 0ms remaining
[==========================================================================>   ]
error: panicked at 'called `Result::unwrap()` on an `Err` value: SendError { .. }', crates/cli/
   0: tangram::train::train::{{closure}}
             at /home/grayshade/tangram/crates/cli/
   1: std::panicking::rust_panic_with_hook
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/
   2: std::panicking::begin_panic_handler::{{closure}}
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/
   3: std::sys_common::backtrace::__rust_end_short_backtrace
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/sys_common/
   4: rust_begin_unwind
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/
   5: core::panicking::panic_fmt
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/core/src/
   6: core::result::unwrap_failed
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/core/src/
   7: core::result::Result<T,E>::unwrap
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/core/src/
             at /home/grayshade/tangram/crates/cli/
             at /home/grayshade/tangram/crates/cli/
   8: tangram_core::train::train_grid_item::{{closure}}
             at /home/grayshade/tangram/crates/core/
   9: tangram_core::train::train_linear_regressor::{{closure}}
             at /home/grayshade/tangram/crates/core/
  10: tangram_linear::multiclass_classifier::MulticlassClassifier::train
             at /home/grayshade/tangram/crates/linear/
  11: tangram_core::train::train_linear_multiclass_classifier
             at /home/grayshade/tangram/crates/core/
             at /home/grayshade/tangram/crates/core/
             at /home/grayshade/tangram/crates/core/
             at /home/grayshade/tangram/crates/core/
      core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &mut F>::call_once
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/core/src/ops/
  12: core::option::Option<T>::map
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/core/src/
      <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::next
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/core/src/iter/adapters/
      <alloc::vec::Vec<T> as alloc::vec::spec_from_iter_nested::SpecFromIterNested<T,I>>::from_iter
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/alloc/src/vec/
      <alloc::vec::Vec<T> as alloc::vec::spec_from_iter::SpecFromIter<T,I>>::from_iter
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/alloc/src/vec/
  13: <alloc::vec::Vec<T> as core::iter::traits::collect::FromIterator<T>>::from_iter
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/alloc/src/vec/
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/core/src/iter/traits/
             at /home/grayshade/tangram/crates/core/
  14: tangram::train::train::{{closure}}
             at /home/grayshade/tangram/crates/cli/
  15: std::panicking::try::do_call
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/
             at /home/grayshade/tangram/crates/cli/
  16: tangram::main
             at /home/grayshade/tangram/crates/cli/
  17: core::ops::function::FnOnce::call_once
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/core/src/ops/
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/sys_common/
  18: std::rt::lang_start::{{closure}}
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/
  19: core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/core/src/ops/
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/
  20: main
  21: __libc_start_call_main
  22: __libc_start_main@GLIBC_2.2.5
  23: _start
lnicola commented 2 years ago

At a first look, I don't see how the sender can be None on drop, but I'm probably missing something. Anyway, this happens on one dataset I have, if I specify both --file-train and --file-test and a config file.

isabella commented 2 years ago

And it is not happening when you just pass a --file-train and --file-test without a config file?

lnicola commented 2 years ago

Yeah, but then it trains the wrong thing (as per my previous question from today). I suppose there's something wrong with my config file. I have the same classes in both the training and validation file, at least.

isabella commented 2 years ago

It's just bizarre that its a SIGILL. What is the version of tangram that you are using?

lnicola commented 2 years ago

Both the latest release and a git build fail in the same way. SIGILL is an abort, because a thread panicked while panicking (I updated my original comment).

EDIT: I removed my comment below. The test command line was tangram train --file-train training_ss.csv --file-test validation_ss.csv -t CTnumL4A -o t.tangram --config config.json.

isabella commented 2 years ago

I meant the format of your config file shouldnt cause it. We can try and debug this over a video call. Can you join our discord

isabella commented 2 years ago

No problem, I am able to reproduce this on my end. I'll start digging in and let you know what I find.

isabella commented 2 years ago

There is definitely an issue with the progress bar. I'm looking into this more. in the meantime, you can train a model by passing the flag --no-progress

isabella commented 2 years ago

Hi @lnicola I fixed the issue. The problem was that we were using the train_row_count as the total for the progress bar when it was in fact the test_row_count that was the total which caused a value that we assumed to be positive to be negative. The eta was negative and the following line caused the panic

The issue is fixed on the main branch. The same bug should have been hit with regression but because progress draws on a timer and the regression code path was faster, the progress bar didn't get a chance to draw and so that code path was not hit.

lnicola commented 2 years ago

Thanks, it's working now. I just had to make a small change for it to build:

diff --git i/crates/cli/ w/crates/cli/
index 874d77c..cc7836b 100644
--- i/crates/cli/
+++ w/crates/cli/
@@ -97,8 +97,7 @@ pub fn train(args: TrainArgs) -> Result<()> {
                let kill_chip = unsafe { ctrl_c::register_ctrl_c_handler()? };
-               let train_grid_item_outputs =
-                       trainer.train_grid(Some(kill_chip), &mut handle_progress_event)?;
+               let train_grid_item_outputs = trainer.train_grid(kill_chip, &mut handle_progress_event)?;
                unsafe { ctrl_c::unregister_ctrl_c_handler()? };
                if kill_chip.is_activated() {
                        if let Some(progress_thread) = progress_thread.as_mut() {
isabella commented 2 years ago

yes, my bad! Thank you :)