guillaume-be / rust-bert

Rust native ready-to-use NLP pipelines and transformer-based models (BERT, DistilBERT, GPT2,...)
https://docs.rs/crate/rust-bert
Apache License 2.0
2.6k stars 215 forks source link

Evaluation fails when trying to extract keywords from a specific sentence #430

Open edoust opened 11 months ago

edoust commented 11 months ago

I am trying to extract keywords from sentences using the all-MiniLM-L6-v2 model

When using this specific sentence (either alone or in combination with other sentences), the keyword extraction fails: Up 3 Up 4 Down 2 Up 7 Up 2 Down 4 Down 4 Up 6 Up 1 Down 1 Down 3

I know this may not be a meaningful sentence, but it should not cause all sentences to not be evaluated

Is there a way to fix this, or to know which sentences would fail during evaluation?

This is my repro sample: crash-repro-keywords.zip

It contains this code snippet for evaluating the sentence:

let input_strings = ["Up 3 Up 4 Down 2 Up 7 Up 2 Down 4 Down 4 Up 6 Up 1 Down 1 Down 3"].to_vec();

let keyword_extraction_config = KeywordExtractionConfig {
    sentence_embeddings_config: SentenceEmbeddingsConfig::from(SentenceEmbeddingsModelType::AllMiniLmL6V2),
    max_sum_candidates: Some(20),
    diversity: Some(0.3),
    scorer_type: KeywordScorerType::MaximalMarginRelevance,
    ngram_range: (1, 1),
    num_keywords: 6,
    ..Default::default()
};

use rust_bert::pipelines::keywords_extraction::KeywordExtractionModel;
let keyword_extraction_model = KeywordExtractionModel::new(keyword_extraction_config).unwrap();

// Credits: Wikimedia https://en.wikipedia.org/wiki/Rust_(programming_language)
let output = keyword_extraction_model.predict(&input_strings).unwrap();

This is the error that is printed:

`thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value:
Torch("stack expects a non-empty TensorList
Exception raised from stack at C:\\actions-runner\\_work\\pytorch\\pytorch\\builder\\windows\\pytorch\\aten\\src\\ATen\\native\\TensorShape.cpp:2659 (most recent call first):
00007FFCBC57D24200007FFCBC57D1E0 c10.dll!c10::Error::Error [<unknown file> @ <unknown line number>]
00007FFCBC57CE1A00007FFCBC57CDC0 c10.dll!c10::detail::torchCheckFail [<unknown file> @ <unknown line number>]
00007FFC83A8296400007FFC83A82900 torch_cpu.dll!at::native::stack [<unknown file> @ <unknown line number>]
00007FFC845E262B00007FFC845DE0B0 torch_cpu.dll!at::compositeexplicitautograd::view_copy_symint_outf [<unknown file> @ <unknown line number>]
00007FFC845BD46100007FFC84578730 torch_cpu.dll!at::compositeexplicitautograd::bucketize_outf [<unknown file> @ <unknown line number>]
00007FFC83FA845600007FFC83FA82B0 torch_cpu.dll!at::_ops::stack::call [<unknown file> @ <unknown line number>]
00007FF68E5A4C8E00007FF68E5A4C50 crash-repro-keywords.exe!at::stack [C:\\temp\\libtorch\\include\\ATen\\ops\\stack.h @ 27]
00007FF68E544D1700007FF68E544CA0 crash-repro-keywords.exe!atg_stack [C:\\Users\\usr1\\.cargo\\registry\\src\\index.crates.io-6f17d22bba15001f\\torch-sys-0.13.0\\libtch\\torch_api_generated.cpp @ 16438]
00007FF68E37827A00007FF68E378180 crash-repro-keywords.exe!tch::wrappers::tensor::Tensor::f_stack<tch::wrappers::tensor::Tensor> [C:\\Users\\usr1\\.cargo\\registry\\src\\index.crates.io-6f17d22bba15001f\\tch-0.13.0\\src\\wrappers\\tensor_fallible_generated.rs @ 33246]
00007FF68E37A65600007FF68E37A630 crash-repro-keywords.exe!tch::wrappers::tensor::Tensor::stack<tch::wrappers::tensor::Tensor> [C:\\Users\\usr1\\.cargo\\registry\\src\\index.crates.io-6f17d22bba15001f\\tch-0.13.0\\src\\wrappers\\tensor_generated.rs @ 16878]
00007FF68D760C3500007FF68D760B80 crash-repro-keywords.exe!rust_bert::pipelines::sentence_embeddings::pipeline::SentenceEmbeddingsModel::encode_as_tensor<ref$<enum2$<alloc::borrow::Cow<str$> > > > [C:\\Users\\usr1\\.cargo\\registry\\src\\index.crates.io-6f17d22bba15001f\\rust-bert-0.21.0\\src\\pipelines\\sentence_embeddings\\pipeline.rs @ 353]
00007FF68D72141C00007FF68D7211F0 crash-repro-keywords.exe!rust_bert::pipelines::keywords_extraction::pipeline::KeywordExtractionModel::predict<ref$<str$> > [C:\\Users\\usr1\\.cargo\\registry\\src\\index.crates.io-6f17d22bba15001f\\rust-bert-0.21.0\\src\\pipelines\\keywords_extraction\\pipeline.rs @ 230]
00007FF68D7A891F00007FF68D7A8610 crash-repro-keywords.exe!crash_repro_keywords::get_keywords [E:\\_local\\crash-repro-keywords\\src\\main.rs @ 36]
00007FF68D7A856F00007FF68D7A8520 crash-repro-keywords.exe!crash_repro_keywords::main [E:\\_local\\crash-repro-keywords\\src\\main.rs @ 9]
00007FF68D7DB90B00007FF68D7DB900 crash-repro-keywords.exe!core::ops::function::FnOnce::call_once<void (*)(),tuple$<> > [/rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3\\library\\core\\src\\ops\\function.rs @ 250]
00007FF68D7A280E00007FF68D7A2800 crash-repro-keywords.exe!std::sys_common::backtrace::__rust_begin_short_backtrace<void (*)(),tuple$<> > [/rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3\\library\\std\\src\\sys_common\\backtrace.rs @ 138]
00007FF68D7BE6E100007FF68D7BE6D0 crash-repro-keywords.exe!std::rt::lang_start::closure$0<tuple$<> > [/rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3\\library\\std\\src\\rt.rs @ 166]
00007FF68E47F4A800007FF68E47F3F0 crash-repro-keywords.exe!std::rt::lang_start_internal [/rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library\\std\\src\\rt.rs @ 148]
00007FF68D7BE6BA00007FF68D7BE680 crash-repro-keywords.exe!std::rt::lang_start<tuple$<> > [/rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3\\library\\std\\src\\rt.rs @ 165]
00007FF68D7A8FB900007FF68D7A8FA0 crash-repro-keywords.exe!main [<unknown file> @ <unknown line number>]00007FF68E6B41CC00007FF68E6B40C0 crash-repro-keywords.exe!__scrt_common_main_seh [D:\\a\\_work\\1\\s\\src\\vctools\\crt\\vcstartup\\src\\startup\\exe_common.inl @ 288]
00007FFDE131257D00007FFDE1312560 KERNEL32.DLL!BaseThreadInitThunk [<unknown file> @ <unknown line number>]
00007FFDE2EAAA7800007FFDE2EAAA50 ntdll.dll!RtlUserThreadStart [<unknown file> @ <unknown line number>]
")', C:\Users\usr1\.cargo\registry\src\index.crates.io-6f17d22bba15001f\tch-0.13.0\src\wrappers\tensor_generated.rs:16878:39`

This is the stack trace:

stack backtrace:
   0:     0x7ff68e48b9cc - std::sys_common::backtrace::_print::impl$0::fmt
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library\std\src\sys_common\backtrace.rs:44
   1:     0x7ff68e4a942b - core::fmt::rt::Argument::fmt
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library\core\src\fmt\rt.rs:138
   2:     0x7ff68e4a942b - core::fmt::write
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library\core\src\fmt\mod.rs:1094
   3:     0x7ff68e48657f - std::io::Write::write_fmt<std::sys::windows::stdio::Stderr>
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library\std\src\io\mod.rs:1714
   4:     0x7ff68e48b77b - std::sys_common::backtrace::_print
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library\std\src\sys_common\backtrace.rs:47
   5:     0x7ff68e48b77b - std::sys_common::backtrace::print
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library\std\src\sys_common\backtrace.rs:34
   6:     0x7ff68e48df7a - std::panicking::default_hook::closure$1
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library\std\src\panicking.rs:269
   7:     0x7ff68e48dbcf - std::panicking::default_hook
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library\std\src\panicking.rs:288
   8:     0x7ff68e48e62e - std::panicking::rust_panic_with_hook
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library\std\src\panicking.rs:705
   9:     0x7ff68e48e51d - std::panicking::begin_panic_handler::closure$0
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library\std\src\panicking.rs:597
  10:     0x7ff68e48c349 - std::sys_common::backtrace::__rust_end_short_backtrace<std::panicking::begin_panic_handler::closure_env$0,never$>
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library\std\src\sys_common\backtrace.rs:151
  11:     0x7ff68e48e220 - std::panicking::begin_panic_handler
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library\std\src\panicking.rs:593
  12:     0x7ff68e6b6a85 - core::panicking::panic_fmt
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library\core\src\panicking.rs:67
  13:     0x7ff68e6b7093 - core::result::unwrap_failed
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library\core\src\result.rs:1651
  14:     0x7ff68e38507b - enum2$<core::result::Result<tch::wrappers::tensor::Tensor,enum2$<tch::error::TchError> > >::unwrap<tch::wrappers::tensor::Tensor,enum2$<tch::error::TchError> >
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3\library\core\src\result.rs:1076
  15:     0x7ff68e37a667 - tch::wrappers::tensor::Tensor::stack<tch::wrappers::tensor::Tensor>
                               at C:\Users\usr1\.cargo\registry\src\index.crates.io-6f17d22bba15001f\tch-0.13.0\src\wrappers\tensor_generated.rs:16878
  16:     0x7ff68d760c35 - rust_bert::pipelines::sentence_embeddings::pipeline::SentenceEmbeddingsModel::encode_as_tensor<ref$<enum2$<alloc::borrow::Cow<str$> > > >
                               at C:\Users\usr1\.cargo\registry\src\index.crates.io-6f17d22bba15001f\rust-bert-0.21.0\src\pipelines\sentence_embeddings\pipeline.rs:353
  17:     0x7ff68d72141c - rust_bert::pipelines::keywords_extraction::pipeline::KeywordExtractionModel::predict<ref$<str$> >
                               at C:\Users\usr1\.cargo\registry\src\index.crates.io-6f17d22bba15001f\rust-bert-0.21.0\src\pipelines\keywords_extraction\pipeline.rs:230
  18:     0x7ff68d7a891f - crash_repro_keywords::get_keywords
                               at E:\_local\crash-repro-keywords\src\main.rs:36
  19:     0x7ff68d7a856f - crash_repro_keywords::main
                               at E:\_local\crash-repro-keywords\src\main.rs:9
  20:     0x7ff68d7db90b - core::ops::function::FnOnce::call_once<void (*)(),tuple$<> >
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3\library\core\src\ops\function.rs:250
  21:     0x7ff68d7a280e - std::sys_common::backtrace::__rust_begin_short_backtrace<void (*)(),tuple$<> >
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3\library\std\src\sys_common\backtrace.rs:135
  22:     0x7ff68d7a280e - std::sys_common::backtrace::__rust_begin_short_backtrace<void (*)(),tuple$<> >
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3\library\std\src\sys_common\backtrace.rs:135
  23:     0x7ff68d7be6e1 - std::rt::lang_start::closure$0<tuple$<> >
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3\library\std\src\rt.rs:166
  24:     0x7ff68e47f4a8 - std::rt::lang_start_internal::closure$2
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library\std\src\rt.rs:148
  25:     0x7ff68e47f4a8 - std::panicking::try::do_call
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library\std\src\panicking.rs:500
  26:     0x7ff68e47f4a8 - std::panicking::try
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library\std\src\panicking.rs:464
  27:     0x7ff68e47f4a8 - std::panic::catch_unwind
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library\std\src\panic.rs:142
  28:     0x7ff68e47f4a8 - std::rt::lang_start_internal
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library\std\src\rt.rs:148
  29:     0x7ff68d7be6ba - std::rt::lang_start<tuple$<> >
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3\library\std\src\rt.rs:165
  30:     0x7ff68d7a8fb9 - main
  31:     0x7ff68e6b41cc - invoke_main
                               at D:\a\_work\1\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl:78
  32:     0x7ff68e6b41cc - __scrt_common_main_seh
                               at D:\a\_work\1\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl:288
  33:     0x7ffde131257d - BaseThreadInitThunk
  34:     0x7ffde2eaaa78 - RtlUserThreadStart
guillaume-be commented 11 months ago

Hello @edoust ,

I am unable to reproduce, running the code shared above gives:

[[Keyword { text: "4", score: 0.4896546, offsets: [Offset { begin: 8, end: 9 }, Offset { begin: 32, end: 33 }, Offset { begin: 39, end: 40 }] }, Keyword { text: "6", score: 0.435
37995, offsets: [Offset { begin: 44, end: 45 }] }, Keyword { text: "3", score: 0.42764783, offsets: [Offset { begin: 3, end: 4 }, Offset { begin: 63, end: 64 }] }, Keyword { text
: "7", score: 0.4275967, offsets: [Offset { begin: 20, end: 21 }] }, Keyword { text: "1", score: 0.38711178, offsets: [Offset { begin: 49, end: 50 }, Offset { begin: 56, end: 57 
}] }, Keyword { text: "2", score: 0.34410587, offsets: [Offset { begin: 15, end: 16 }, Offset { begin: 25, end: 26 }] }]]
edoust commented 11 months ago

Hey @guillaume-be

I noticed this does not happen when I checkout this repo and run the mentioned sentence in the example for keyword extraction.

When I run my code in my provided example project it fails however, with the error mentioned above

Did you test it with my project? Also, I was running it on Windows