word counts - Githubissues

mmaz commented 4 years ago

Really great job on kicking off the wordcount feature Tejas! Excited to see you making progress so fast. Some suggestions on next steps:

[x] It looks like the current script produces a csv of wordcounts for an input list of keywords. I think what we're looking for is rather, a csv of wordcounts for all words present in the .tsv file (after they have been normalized with clean_and_filter). Let me know if you have questions about this
[x] Excellent to see type annotations! Can you also add docstrings please?
[x] use standard __main__ (link)
- I think you can omit sys.argv since you're using argparse
[x] format with black (https://github.com/psf/black)
[x] rename the file to snakecasing (I have a bad habit of camelcasing .ipynb files but I think python files should be lowercased; eventually we will move several of these functions into a library)

Again, great job!! Let me know if you have any questions or if these suggestions don't make sense.

tejasprabhune commented 4 years ago

I made the changes! Let me know what you think.

mmaz commented 4 years ago

Nice! One thing to consider is whether it would be better to run clean_and_filter on each input sentence, though (with some refactoring). In other words, will keyword_set have separate entries for lowercase and uppercase words, for example?

It might be helpful to also create a unit test to test some corner cases for this script, and also to document some shortcomings that we aren't currently addressing (such as not combining word stems, which is fine for now). For example:

input = """#sentence
Three apples, three oranges 3 pears & one more pear"""

Your unit test might verify we have

{ "three" : 2,
  "pears": 1,
  "pear": 1,
  "and": 0,
  ...}

harvard-edge / multilingual_kws

word counts #1