From @DamienIrving's comments on #400 - this needs to be done and rippled through subsequent chapters, but @gvwilson doesn't want to hold up merges.
We might want to fill out py-rse/solutions.Rmd (and update collate.py accordingly) for this chapter before merging. A bunch of new concepts about catching errors and logging are introduced during the chapter and we talk a little about how they might be introduced into collate.py, but we don't actually go ahead and put those changes into collate.py and run it. That's left up to the exercises. As such, it's not clear what the final version of collate.py (i.e. the one that is later packaged up so that it can be pip installed) actually looks like.
I'm thinking it needs to look like this:
"""Combine multiple word count CSV-files into a single cumulative count."""
import csv
import argparse
from collections import Counter
import logging
import utilities
def update_counts(reader, word_counts):
"""Update word counts with data from another reader/file."""
for word, count in csv.reader(reader):
word_counts[word] += int(count)
def main(args):
"""Run the command line program."""
log_level = logging.DEBUG if args.verbose else logging.WARNING
logging.basicConfig(level=log_level, filename=args.logfile)
word_counts = Counter()
logging.info('Processing files...')
for file_name in args.infiles:
logging.debug(f'Reading in {file_name}...')
if file_name[-4:] != '.csv':
raise OSError(f'{file_name} must end in `.csv`.')
with open(file_name, 'r') as reader:
logging.debug('Computing word counts...')
update_counts(reader, word_counts)
utilities.collection_to_csv(word_counts, num=args.num)
if __name__ == '__main__':
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument('infiles', type=str, nargs='*', help='Input file names')
parser.add_argument('-n', '--num', type=int, default=None,
help='Limit output to N most frequent words')
parser.add_argument('-v', '--verbose', action="store_true", default=False,
help="Change logging threshold from WARNING to DEBUG")
parser.add_argument('-l', '--logfile', type=str, default='collate.log',
help='Name of the log file')
args = parser.parse_args()
main(args)
We can then run it as follows:
$ python bin/collate.py results/dracula.csv results/moby_dick.csv -n 10 -v
the,22559
and,12306
of,10446
to,9192
a,7629
in,6745
i,6557
that,5373
it,4464
he,4260
$ cat collate.log
INFO:root:Processing files...
DEBUG:root:Reading in results/dracula.csv...
DEBUG:root:Computing word counts...
DEBUG:root:Reading in results/moby_dick.csv...
DEBUG:root:Computing word counts...
A few comments on that suggested final version of collate.py:
Using the filename parameter in logging.basicConfig is important because if the logging information is sent to standard output then it will end up in our CSV file if/when we run something like python bin/collate.py results/dracula.csv results/moby_dick.csv > results/collated.csv
One of the exercises says "can you use any functions from the csv library to help with this?" Does that mean there is a better alternative to if file_name[-4:] != '.csv' that you think should be in the final version of collate.py.
When assertions are introduced in the testing chapter an assertion is added to collate.py to check if the input file ends in .csv. When we get to editing that chapter we'll have to pick a different example for adding an assertion to our code.
From @DamienIrving's comments on #400 - this needs to be done and rippled through subsequent chapters, but @gvwilson doesn't want to hold up merges.
We might want to fill out
py-rse/solutions.Rmd
(and updatecollate.py
accordingly) for this chapter before merging. A bunch of new concepts about catching errors and logging are introduced during the chapter and we talk a little about how they might be introduced intocollate.py
, but we don't actually go ahead and put those changes intocollate.py
and run it. That's left up to the exercises. As such, it's not clear what the final version ofcollate.py
(i.e. the one that is later packaged up so that it can be pip installed) actually looks like.I'm thinking it needs to look like this:
We can then run it as follows:
A few comments on that suggested final version of
collate.py
:filename
parameter inlogging.basicConfig
is important because if the logging information is sent to standard output then it will end up in our CSV file if/when we run something likepython bin/collate.py results/dracula.csv results/moby_dick.csv > results/collated.csv
csv
library to help with this?" Does that mean there is a better alternative toif file_name[-4:] != '.csv'
that you think should be in the final version ofcollate.py
.collate.py
to check if the input file ends in .csv. When we get to editing that chapter we'll have to pick a different example for adding an assertion to our code.