messense / jieba-rs

The Jieba Chinese Word Segmentation Implemented in Rust
MIT License
738 stars 46 forks source link

`index out of bounds` when using custom dictionary #65

Closed visig9 closed 4 years ago

visig9 commented 5 years ago

Test Case

use jieba_rs::Jieba;

fn main() {
    use std::{fs::File, io::BufReader};
    let mut dict = BufReader::new(File::open("/home/goodguy/dict.txt.big").unwrap());
    let jieba = Jieba::with_dict(&mut dict).unwrap();
    println!("{:?}", jieba.cut("Firefox全球市佔率為35%至40%,為全球第二流行的網頁瀏覽器[17][18][19][20][21]。Firefox在某些國家還是最流行的網頁瀏覽器,如在薩摩亞、德國、厄利垂亞及古巴,Firefox市佔率分別為61.05%、38.36%、79.39%及85.93%。據Mozilla統計,截至2014年12月,Firefox在全世界擁有10億使用者[22]。", false));
}

The dict.txt.big can fetch from https://github.com/fxsjy/jieba/blob/master/extra_dict/dict.txt.big

Result

thread 'main' panicked at 'index out of bounds: the len is 33908 but the index is 67813', /rustc/a53f9df32fbb0b5f4382caaad8f1a46f36ea887c/src/libcore/slice/mod.rs:2695:10
stack backtrace:
   0: std::sys::unix::backtrace::tracing::imp::unwind_backtrace
             at src/libstd/sys/unix/backtrace/tracing/gcc_s.rs:39
   1: std::sys_common::backtrace::_print
             at src/libstd/sys_common/backtrace.rs:71
   2: std::panicking::default_hook::{{closure}}
             at src/libstd/sys_common/backtrace.rs:59
             at src/libstd/panicking.rs:197
   3: std::panicking::default_hook
             at src/libstd/panicking.rs:211
   4: std::panicking::rust_panic_with_hook
             at src/libstd/panicking.rs:474
   5: std::panicking::continue_panic_fmt
             at src/libstd/panicking.rs:381
   6: rust_begin_unwind
             at src/libstd/panicking.rs:308
   7: core::panicking::panic_fmt
             at src/libcore/panicking.rs:85
   8: core::panicking::panic_bounds_check
             at src/libcore/panicking.rs:61
   9: <usize as core::slice::SliceIndex<[T]>>::index
             at /rustc/a53f9df32fbb0b5f4382caaad8f1a46f36ea887c/src/libcore/slice/mod.rs:2695
  10: core::slice::<impl core::ops::index::Index<I> for [T]>::index
             at /rustc/a53f9df32fbb0b5f4382caaad8f1a46f36ea887c/src/libcore/slice/mod.rs:2552
  11: <alloc::vec::Vec<T> as core::ops::index::Index<I>>::index
             at /rustc/a53f9df32fbb0b5f4382caaad8f1a46f36ea887c/src/liballoc/vec.rs:1687
  12: jieba_rs::Jieba::calc::{{closure}}
             at /home/goodguy/.cargo/registry/src/github.com-1ecc6299db9ec823/jieba-rs-0.4.9/src/lib.rs:359
  13: core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &mut F>::call_once
             at /rustc/a53f9df32fbb0b5f4382caaad8f1a46f36ea887c/src/libcore/ops/function.rs:279
  14: core::option::Option<T>::map
             at /rustc/a53f9df32fbb0b5f4382caaad8f1a46f36ea887c/src/libcore/option.rs:416
  15: <core::iter::adapters::Map<I,F> as core::iter::traits::iterator::Iterator>::next
             at /rustc/a53f9df32fbb0b5f4382caaad8f1a46f36ea887c/src/libcore/iter/adapters/mod.rs:570
  16: core::iter::traits::iterator::select_fold1
             at /rustc/a53f9df32fbb0b5f4382caaad8f1a46f36ea887c/src/libcore/iter/traits/iterator.rs:2599
  17: core::iter::traits::iterator::Iterator::max_by
             at /rustc/a53f9df32fbb0b5f4382caaad8f1a46f36ea887c/src/libcore/iter/traits/iterator.rs:2082
  18: jieba_rs::Jieba::calc
             at /home/goodguy/.cargo/registry/src/github.com-1ecc6299db9ec823/jieba-rs-0.4.9/src/lib.rs:349
  19: jieba_rs::Jieba::cut_dag_no_hmm
             at /home/goodguy/.cargo/registry/src/github.com-1ecc6299db9ec823/jieba-rs-0.4.9/src/lib.rs:421
  20: jieba_rs::Jieba::cut_internal
             at /home/goodguy/.cargo/registry/src/github.com-1ecc6299db9ec823/jieba-rs-0.4.9/src/lib.rs:572
  21: jieba_rs::Jieba::cut
             at /home/goodguy/.cargo/registry/src/github.com-1ecc6299db9ec823/jieba-rs-0.4.9/src/lib.rs:612
  22: testproj::main
             at src/main.rs:7
  23: std::rt::lang_start::{{closure}}
             at /rustc/a53f9df32fbb0b5f4382caaad8f1a46f36ea887c/src/libstd/rt.rs:64
  24: std::panicking::try::do_call
             at src/libstd/rt.rs:49
             at src/libstd/panicking.rs:293
  25: __rust_maybe_catch_panic
             at src/libpanic_unwind/lib.rs:85
  26: std::rt::lang_start_internal
             at src/libstd/panicking.rs:272
             at src/libstd/panic.rs:394
             at src/libstd/rt.rs:48
  27: std::rt::lang_start
             at /rustc/a53f9df32fbb0b5f4382caaad8f1a46f36ea887c/src/libstd/rt.rs:64
  28: main
  29: __libc_start_main
  30: _start

Clue

Look into the dict.txt.big...

些許端 2 m
些许 163 d
些许端 2 m
些須 3 d
些须 3 d
亜 20 zg
亝 6 zg
亞 13 zg
亞 5789 j
亞丁 57 nrt
亞丁港 4 ns
亞丁灣 37 ns
亞世 6 nrt
亞世達 2 nr
亞乃夫 2 nr
亞之傑 3 nr
亞乙基 2 nr
亞伊傑 2 nr
亞伯 2 ns
亞伯奎 6 nr
亞伯拉罕 19 nrt

... when I delete all the lines after 亞 5789 j (line number 33908), the panic disappeared. The result look like following:

["Firefox", "全", "球", "市", "佔", "率", "為", "35", "%", "至", "40", "%", ",", "為", "全", "球", "第", "二流", "行", "的", "網", "頁", "瀏", "覽", "器", "[", "17", "]", "[", "18", "]", "[", "19", "]", "[", "20", "]", "[", "21", "]", "。", "Firefox", "在", "某", "些", "國", "家", "還", "是", "最", "流", "行", "的", "網", "頁", "瀏", "覽", "器", ",", "如", "在", "薩", "摩", "亞", "、", "德", "國", "、", "厄", "利", "垂", "亞", "及", "古", "巴", ",", "Firefox", "市", "佔", "率", "分", "別", "為", "61", ".", "05", "%", "、", "38", ".", "36", "%", "、", "79", ".", "39", "%", "及", "85", ".", "93", "%", "。", "據", "Mozilla", "統", "計", ",", "截", "至", "2014", "年", "12", "月", ",", "Firefox", "在", "全", "世界", "擁", "有", "10", "億", "使", "用", "者", "[", "22", "]", "。"]

Env

messense commented 4 years ago

I think this is caused by the fact that we didn't consider that there would be duplications in the dict.

messense commented 4 years ago

cc @MnO2

MnO2 commented 4 years ago

ok, I can have a look.

messense commented 4 years ago

@MnO2

https://github.com/messense/jieba-rs/blob/4fd483c0580528734e01328862d3475f98fb834e/src/lib.rs#L251-L256

For with_dict method, this line of code in the load_dict method:

https://github.com/messense/jieba-rs/blob/4fd483c0580528734e01328862d3475f98fb834e/src/lib.rs#L313

will always return None since self.cedar isn't built yet:

https://github.com/messense/jieba-rs/blob/4fd483c0580528734e01328862d3475f98fb834e/src/lib.rs#L333-L334

MnO2 commented 4 years ago

It is related to how we expect the load_dict should be called, and its semantic.

If you call load_dict and call it again, exact_match_search shouldn't always return None.

Since the method is public, the method could be called again and again. Then there are a few approaches we could choose.

  1. Merge the dictionaries, and resolve the conflict by choosing the frequencies appear in the dictionary called later.
  2. Merge the dictionaries, but resort to other conflict resolution.
  3. Throw everything from the previous dictionaries away and use everything in the last call.

For the code we have right now it is 1, though from a brief overlook I couldn't figure out why it causes panic, and why your PR by applying self.cedar.update(word, word_id); in the for loop has fixed the issue. Since update should already be covered by build in cedarwood, which is basically a for loop for update.

/// Build the double array trie from the given key value pairs
    #[allow(dead_code)]
    pub fn build(&mut self, key_values: &[(&str, i32)]) {
        for (key, value) in key_values {
            self.update(key, *value);
        }
    }

I guess there might be hidden bugs in cedarwood, I need to spend more time on it, but in the meantime I am good with the PR though I am not exactly sure why it fixed the panic,

MnO2 commented 4 years ago

https://github.com/MnO2/cedarwood/commit/858e9033395ad0cb5aa86347958cb7626c4446d4 This commit should fix the issue.