Closed gerritholl closed 4 years ago
Nice issue! Thanks for filing this. Just an initial debugging to try to understand the icicle plot: there are 20 ABI file patterns listed in the YAML. The icicles are a little awkward because of generators, but the main starting point is here:
Assuming self.sorted_filetype_items
generates 20 items and self.filename_items_for_filetype
calls globify once for each pattern for every per filename (it is actually less than that), then the worst case for this loop should be calling globify 20 * 125 = 2500 times. Right? So where are these 50000 calls to globify coming from? It isn't a recursive function and globify
is only called twice in yaml_reader.py
:
And:
@gerritholl I'd be curious what kind of difference you see if you move the globify
in this function outside the for loop:
I'll dig into checking why it's called so often later. I wonder if there is a tracing tool that can record the call stack every time globify
(or any other function) is called and then show the statistics on that. In principle it should be possible so it has probably been done, and I seem to recall having seen such a thing, but it was so long ago that I don't even remember for sure if it was Python. I've done a "monte carlo" type investigation of such matters before by setting a conditional breakpoint with the condition random.random() < x
.
Could even just add a print
statement inside globify that uses a global index or do print("globify")
and when you run the code from the command line do python my_script.py | grep globify | wc -l
. But yeah, doing it the right way would be nice :wink:
Using the ancient trace
module that comes with Python,
python -m trace -c -C . /tmp/mwe37.py
reveals that
1: def globify(fmt, keyvals=None):
"""Generate a string usable with glob.glob() from format string
*fmt* and *keyvals* dictionary.
"""
50020: if keyvals is None:
50020: keyvals = {}
50020: return globify_formatter.format(fmt, **keyvals)
we can tell that globify
is called in two places in yaml_reader
:
50000: if fnmatch(get_filebase(filename, pattern), globify(pattern)):
20: globified = globify(pattern)
a closer look at match_filenames
reveals the loop is executed 50000 times:
1: def match_filenames(filenames, pattern):
"""Get the filenames matching *pattern*."""
20: matching = []
50020: for filename in filenames:
50000: if fnmatch(get_filebase(filename, pattern), globify(pattern)):
2000: matching.append(filename)
20: return matching
Inspecting this with the debugger confirms that len(filenames) == 2500
, that's 125 filenames 20 patterns. This function is called 20 times. So something is being inefficient, as instead of `20125=2500it's called
2020125=50000` times.
:confused: So why would filenames be that long? Ok thanks. If you aren't already expecting to work on this I can take a look next week.
Figured part of it out:
If the various patterns, as glob patterns, match the same files, there is nothing to stop duplicate filenames from being present until later on when we use sets (which I know we do in some places, but not even sure if in this method).
Edit: Thinking about this more, we should make a set of the glob patterns, then a set of the results.
Starting to work on this. I'm using a directory with 5763 ABI files in it (mesoscale) and running:
import sys
from datetime import datetime
from satpy.readers import find_files_and_readers
def main():
base_dir = '/data/satellite/abi/20180911_florence'
reader = 'abi_l1b'
st = datetime(2018, 9, 11, 0, 0, 0)
return find_files_and_readers(
base_dir=base_dir,
reader=reader,
start_time=st,
)
if __name__ == "__main__":
sys.exit(main())
In pycharms profiler which calls cProfile.
Globify is called 115220 times and takes up ~11.5s of the total ~16.5s.
Moving the globify
outside of the loop in:
Cut off 75% of my run time. Here's the new call graph (globify isn't even listed as a main actor):
I'll make a PR and then work on additional optimizations.
Describe the bug
The function
satpy.readers.find_files_and_readers
is slow. This is particularly clear for instruments with many files and many file patterns. For ABI L1b, an hour of M6 full disk data consists of 96 files, for M6 CONUS data this is even 196 files. There are 16 file patterns defined in the yaml file. It appears thatfind_files_and_readers
scales poorly with both the number of files and the number of patterns. With 125 files,find_files_and_readers
takes nearly 3 seconds, even if all actual disk I/O is filtered out.To Reproduce
This reproduction is based on current master but with the branch for PR #1169 merged, so that I can implement a dummy filesystem class to exclude I/O from the performance measurement. In this example,
glob
always returns a list of the same 125 files instantly:Running this with the Python profiler:
gives us a profile file that we can visualise with snakeviz.
Expected behavior
I would expect this to run quickly. Sure, it needs to test 125 files up to 16 times, but a single call to
fnmatch
only lasts 689 ns ± 45 ns:Even when multiplied by 125*16 this makes just 1.378 milliseconds, 3 orders of magnitude less than
find_files_and_readers
, so the overhead is somewhere else than in matching the filenames. I think it should be reasonable to expectfind_files_and_readers
to run at least 10 times faster than it does now and probably in less than 100 ms for this particular case.Actual results
When running without the profiler, the function takes 2.4 seconds. When running with the profiler, it takes about 6.1-6.3 seconds. The profiling results (see above) reveals that cumulatively, 5.12 seconds are spent in
globify
, which is apparently called 50020 times. The full profile:mwe37b.prof.gz
Screenshots
Screenshot of the "Icicle" type visualisation in snakeviz, revealing 6.13 seconds for
find_files_and_readers
including a cumulative 5.12 seconds inglobify
:When sorting the table of statistics per function by cumulative time, we can see the number of calls for the most time-consuming parts of the code:
Apparently, there are 20 calls to
match_filenames
, that's OK, but then 50020 calls toglobify
, which seems excessive.Environment Info:
v0.21.0-14-g65cd83f9
with both #1165 and #1169 merged.Additional context
This may seem such a big deal compared to the I/O involved with reading or downloading the data, nor with the total processing time. However, in fogtools it is considerably slowing down unit tests that simulate downloading of ABI data including file selection for different parameters, which is where I notice this most.