rsokl / Learning_Python

Source material for Python Like You Mean it
https://www.pythonlikeyoumeanit.com/
Other
165 stars 54 forks source link

Adding a glob subsection #51

Closed rsokl closed 6 years ago

rsokl commented 6 years ago

@davidmascharka do you have any interest in adding a brief subsection to Working with Files about glob? Maybe entitled: "finding files with glob".

I have single Path-method example using glob, but it is so essential that it probably deserves its own brief subsection that shows off the patterns a bit more. We can probably forego discussing the glob module and keep it in the context of a Path method.

I figured that you can do more justice to Unix-style pattern matching than I ;)

If you want, you can just post the markdown here, and I can incorporate it into the notebook. Whatever you want to do (if you want to write this at all)

davidmascharka commented 6 years ago

Yeah, globbing is super useful (although full-on regex matching is better...). I'll put something together in the next few days on it and toss it under here in a comment.

davidmascharka commented 6 years ago

@rsokl this is probably incomplete but it's a draft.

Globbing for Files

There are many cases in which we may want to construct a list of files to iterate over. For example, if we have several data files, it would be useful to create a file list which we can iterate through and process in sequence. One way to do this would be to manually construct such a list of files:

my_files = ['data/file1.txt', 'data/file2.txt', 'data/file3.txt', 'data/file4.txt']

However, this is extraordinarily tedious and prone to error, either by mis-typing a file name or forgetting a file. A much more powerful way to construct such a list of files is by file globbing. A glob is a set of file names matching some pattern. To glob files, we use special wildcard characters that will match all the files with a certain part of a file name. In our case, * will be the wildcard character we use the most. This is much better motivated with an example. Below, we see some globs and the types of patterns they will match:

# matches anything that starts with `file` and ends with `.txt` like 
# file1.txt, filefilefile.txt, file.txt, file12345.txt, ...
file*.txt 

# matches any file name
*

# matches all png image files
*.png

# matches anything that contains 'test' as part of its file name
*test*

# matches all .py files that contain 'number'
*number*.py

Exercise: write a glob pattern that matches

The * wildcard is not the only one available to us. Sometimes it can be useful to match certain subsets of characters. For example, we may only want to match file names that start with a number. With the * wildcard alone, that's not possible. Luckily for us, these common use-cases are also taken care of.

To match a subset of characters, we can use square brackets: [abc]* will match anything that starts with 'a', 'b', or 'c' and nothing else. We can also use a '-' inside our brackets to glob groups of characters. For example:

# matches any file that starts with a number
[0-9]*.txt

# matches any file that has a vowel in its name
*[aeiou]*

# matches any file that starts with a lowercase letter
[a-z]*

Exercise: write a glob pattern that matches

The pathlib module provides convenient functionality for globbing files. Once we have a Path object, we can simply call glob() on it and pass in a glob string:

root_dir = Path('.')
files = root_dir.glob('test*.txt')
# files is a generator containing all the files that start with 'test' and end with '.txt'

for file in files:
    with open(file, 'r') as f:
        # do some processing

For more details on globbing, see the documentation

rsokl commented 6 years ago

This rocks! I will incorporate this immediately

rsokl commented 6 years ago

Added to version 0.14