ogallagher / terry

Terry the virtual secreTERRY
0 stars 0 forks source link

Inaccurate token dictionary lookup #6

Closed ogallagher closed 4 years ago

ogallagher commented 4 years ago

Unfortunately, the trained deepspeech model I’m using for the scribe is not as accurate as I’d like, so when the instruction classifier is parsing tokens and tries to match them to dictionary entries, typos in the transcription make it unlikely to find a perfect match. I think this is largely due to the fact that the model I’m using was trained on casual phone conversation, which has a much larger and different vocabulary than what Terry would expect to hear from user instructions. Therefore, very slight variations in enunciation can result in transcription to very different words, given the vast range of the output set.

To fix this issue of a generic and vast transcription vocabulary, I can either:

  1. Customize the vocabulary used to build the model. Apparently the corpus used to train the original model and the one that’s publicly available are not the same dataset, but this should in theory yield similar results. However, in practice customizing a new model looks to be a substantial amount of work that does not guarantee improvement.
  2. Make dictionary lookups probabilistic, factoring in word edit distance.

I plan to take the second approach; working with faulty transcriptions and creating scored dictionary lookups based on word edit distance. I’ll use the Wagner-Fischer algorithm.

ogallagher commented 4 years ago

An issue with scored lookups is that now, instead of leveraging the hash map structure of the dictionary for quick entry retrieval, Terry needs to search multiple (if not all) entry keys, score edit distances, and return the entry with the best score.

If on every typo Terry has to search all the keys in the hash map to compare edit distance, instruction parsing will slow down a lot, though this step could be parallelized.

ogallagher commented 4 years ago

I finished a first draft of an editDistance() method, currently used for dictionary lookups for initial language mappings during the instruction classification step.

The problem I see here is that when resolving subsequent tokens for possible followers in a given set of mappings, this currently requires exact match. So, in the future I should extend the use of edit distance to resolving a token against followers.

ogallagher commented 4 years ago

Use of edit distance for follower token resolution is now implemented.