harvard-edge / multilingual_kws

Few-shot Keyword Spotting in Any Language and Multilingual Spoken Word Corpus
155 stars 35 forks source link

word counts #1

Closed mmaz closed 3 years ago

mmaz commented 4 years ago

Really great job on kicking off the wordcount feature Tejas! Excited to see you making progress so fast. Some suggestions on next steps:

Again, great job!! Let me know if you have any questions or if these suggestions don't make sense.

tejasprabhune commented 4 years ago

I made the changes! Let me know what you think.

mmaz commented 4 years ago

Nice! One thing to consider is whether it would be better to run clean_and_filter on each input sentence, though (with some refactoring). In other words, will keyword_set have separate entries for lowercase and uppercase words, for example?

It might be helpful to also create a unit test to test some corner cases for this script, and also to document some shortcomings that we aren't currently addressing (such as not combining word stems, which is fine for now). For example:

input = """#sentence
Three apples, three oranges 3 pears & one more pear"""

Your unit test might verify we have

{ "three" : 2,
  "pears": 1,
  "pear": 1,
  "and": 0,
  ...}