hotg-ai / proc-blocks

Common processing blocks used with your Runes.
https://hotg-ai.github.io/proc-blocks
9 stars 3 forks source link

The Text Extractor can't handle empty strings #9

Open Michael-F-Bryan opened 2 years ago

Michael-F-Bryan commented 2 years ago

I was playing around with the web runtime when I noticed passing an empty string to the Bert rune will trigger this assertion.

In this case, it looks like end_index and start_index are both zero.

I'm guessing we'll also see this happen when the model is unable to figure out the answer to your question.

Mohit0928 commented 2 years ago

Thanks, @Michael-F-Bryan, for finding it out. I will correct it.

Michael-F-Bryan commented 2 years ago

If it helps, here is what we log when the panic occurs:

{
    "level": "ERROR",
    "message": "panicked at 'Start index: 0 is greater than or equal to end index: 0', /Users/mohit/.cargo/git/checkouts/proc-blocks-49e4d24512f5d324/1e53d6f/text_extractor/src/lib.rs:44:13",
    "target": "hotg_runicos_base_wasm",
    "module_path": "hotg_runicos_base_wasm",
    "file": "/Users/mohit/.cargo/registry/src/github.com-1ecc6299db9ec823/hotg-runicos-base-wasm-0.10.0/src/lib.rs",
    "line": 52
}
Michael-F-Bryan commented 2 years ago

I believe @Mohit0928 created #10 to resolve this, but we're running into the panics in the text_extractor proc block, while that PR's changes were added to the tokenizer proc block.

I think we need to rewrite these lines in a way that lets us handle zero, one, or multiple sentences as an output. Possibly using zip() to iterate over pairs of start/end logits, constructing sentences until we get an invalid pair (e.g. start_index = end_index = 0 because there is no answer).

Mohit0928 commented 2 years ago

@Michael-F-Bryan, When I ran the Bert rune with sentence-1 empty. I'm getting below as error message.

mohit@MacBook-Air bert % rune run bert.rune --raw input1.txt input2.txt
[2021-11-09T15:00:50.288Z WARN  hotg_rune_cli::run::command] TensorFlow Lite has a bug where its MacOS CPU backend will occasionally segfault during inference. See https://github.com/tensorflow/tensorflow/issues/52300 for more.
[2021-11-09T15:00:50.288Z ERROR hotg_runicos_base_wasm] panicked at 'Sentence 1 is empty', /Users/mohit/.cargo/git/checkouts/proc-blocks-49e4d24512f5d324/7c78b51/tokenizers/src/lib.rs:59:9
Error: Call failed

Caused by:
    0: Unable to call the _call function
    1: RuntimeError: The Rune encountered a fatal error
           at <unnamed> (<module>[220]:0x230f7)
           at <unnamed> (<module>[297]:0x24ab1)
           at <unnamed> (<module>[209]:0x22ab3)
           at <unnamed> (<module>[325]:0x25ac5)
           at <unnamed> (<module>[322]:0x259ed)
           at <unnamed> (<module>[122]:0xcecb)
           at <unnamed> (<module>[23]:0x11d8)
           at <unnamed> (<module>[26]:0x2835)
    2: The Rune encountered a fatal error
    3: panicked at 'Sentence 1 is empty', /Users/mohit/.cargo/git/checkouts/proc-blocks-49e4d24512f5d324/7c78b51/tokenizers/src/lib.rs:59:9

Aren't we supposed to get this? Or are we expecting something different?

Michael-F-Bryan commented 2 years ago

No, that's not what we want to do.

The underlying problem is that the model was unable to generate an answer for the provided inputs so the text extractor crashed the Rune (the logits tensor passed to the text extractor was all zeroes). Your "sentence 1 is empty" assertion is in the tokenizer so it won't help prevent this situation.

Another thing to keep in mind is the text passed to a tokenizer is controlled by the user. in general, you should try to handle bad user input graciously (e.g. by setting the tokenizer output to all zeroes) so unexpected input won't crash the runtime. There's a pretty common saying for this:

Be liberal in what you accept, and conservative in what you send - the Robustness Principle

Also, assertions are mainly be used for unrecoverable errors that were most probably caused by a programming bug (e.g. non-finite floats due to a divide by zero). For errors you can expect to see frequently like empty input, we should try to return a sensible default value like an empty string.