swcarpentry / python-intermediate

Intermediate lesson for students who have taken SWC or DC novice python lessons.
1 stars 3 forks source link

IDEA for motivating scenario #9

Open daisieh opened 8 years ago

daisieh commented 8 years ago

File salad cleanup!

We could have many example directories from different domains (phylogenetics, genomics, astronomy, ecology, etc) where you've inherited a terrible directory from a previous student/postdoc/whatever, with a lot of files. If you looked at them, you can tell that they're clearly of several different categories (scripts/notes/data, RAxML/BEAST/phylip, whatever), but the files are terribly named, without good file extensions, with "OLD" or "old" or whatever suffixed, etc. How to clean this up?

We can promise that at the end of the workshop, students will be able to clean these up. Possibly use other people's code to clean them up as well.

So we could then back up and start with a simple fake scenario where there is a large script that cleans up the simple fake directory (or possibly does it wrong). We can then use this script to discuss python syntax and how to break it down and abstract it, write APIs that would do each bit, and then write tests that would verify that each bit works. Then we can show how you could use the same main function that is a stripped-down version of the original script to run different imported subfunctions and then clean up each of the crazy domain-specific directories.

gvwilson commented 8 years ago

+1 --- I've got a short example that finds duplicated files, which could also be part of the cleanup?

!/usr/bin/env python

import sys import os import hashlib

if len(sys.argv) == 1: root = os.curdir elif len(sys.argv) == 2: root = sys.argv[1] else: assert False, 'Usage: dup.py [dir]'

found = {} for (dirpath, dirnames, filenames) in os.walk(root): if '.git/' in dirpath: continue for fn in filenames: path = os.path.join(dirpath, fn) with open(path, 'r') as reader: data = reader.read() digest = hashlib.md5(data).digest() if digest not in found: found[digest] = set() found[digest].add(path)

for digest in found: paths = found[digest] if len(paths) > 1: print paths