stanford-oval / genie-toolkit

The Genie open source kit for voice assistant (formerly known as Almond)
Apache License 2.0
194 stars 35 forks source link

Tokenization bugs with examples #569

Open sileix opened 3 years ago

sileix commented 3 years ago

"30minutes" is tokenized as "30m inutes"; "Search for comedy movies that are rated R." is tokenized as "search for comedy movies that are rated r." (no space between r and period) "4-5 rating" is tokenized as "4 -5 rating" "Show me all profiles with the last name 'Johnson'." is tokenized as "show me all profiles with the last name 'johnson ' ." In "Search for hotels with a pool, no more than hundred miles away.", "hundred" is not tokenized as 100.

gcampax commented 3 years ago

This is literally a bug in the tokenizer, right? We can fix it without significant changes.

sileix commented 3 years ago

This is literally a bug in the tokenizer, right? We can fix it without significant changes.

yeah, I will just use this issue to report a list of bugs in tokenization.

gcampax commented 3 years ago

Are you going to fix these any time soon?