merely-useful / py-rse

Research Software Engineering with Python course material
http://third-bit.com/py-rse/
Other
247 stars 63 forks source link

Add to error handling chapter #412

Closed gvwilson closed 4 years ago

gvwilson commented 4 years ago

From @DamienIrving's comments on #400 - this needs to be done and rippled through subsequent chapters, but @gvwilson doesn't want to hold up merges.

We might want to fill out py-rse/solutions.Rmd (and update collate.py accordingly) for this chapter before merging. A bunch of new concepts about catching errors and logging are introduced during the chapter and we talk a little about how they might be introduced into collate.py, but we don't actually go ahead and put those changes into collate.py and run it. That's left up to the exercises. As such, it's not clear what the final version of collate.py (i.e. the one that is later packaged up so that it can be pip installed) actually looks like.

I'm thinking it needs to look like this:

"""Combine multiple word count CSV-files into a single cumulative count."""
import csv
import argparse
from collections import Counter
import logging
import utilities

def update_counts(reader, word_counts):
    """Update word counts with data from another reader/file."""
    for word, count in csv.reader(reader):
        word_counts[word] += int(count)

def main(args):
    """Run the command line program."""
    log_level = logging.DEBUG if args.verbose else logging.WARNING
    logging.basicConfig(level=log_level, filename=args.logfile)
    word_counts = Counter()
    logging.info('Processing files...')
    for file_name in args.infiles:
        logging.debug(f'Reading in {file_name}...')
        if file_name[-4:] != '.csv':
            raise OSError(f'{file_name} must end in `.csv`.')
        with open(file_name, 'r') as reader:
            logging.debug('Computing word counts...')
            update_counts(reader, word_counts)
    utilities.collection_to_csv(word_counts, num=args.num)

if __name__ == '__main__':
    parser = argparse.ArgumentParser(description=__doc__)
    parser.add_argument('infiles', type=str, nargs='*', help='Input file names')
    parser.add_argument('-n', '--num', type=int, default=None,
                        help='Limit output to N most frequent words')
    parser.add_argument('-v', '--verbose', action="store_true", default=False,
                        help="Change logging threshold from WARNING to DEBUG")
    parser.add_argument('-l', '--logfile', type=str, default='collate.log',
                        help='Name of the log file')
    args = parser.parse_args()
    main(args)

We can then run it as follows:

$ python bin/collate.py results/dracula.csv results/moby_dick.csv -n 10 -v 
the,22559
and,12306
of,10446
to,9192
a,7629
in,6745
i,6557
that,5373
it,4464
he,4260

$ cat collate.log 
INFO:root:Processing files...
DEBUG:root:Reading in results/dracula.csv...
DEBUG:root:Computing word counts...
DEBUG:root:Reading in results/moby_dick.csv...
DEBUG:root:Computing word counts...

A few comments on that suggested final version of collate.py: