find_files_and_readers is slow

gerritholl commented 4 years ago

Describe the bug

The function satpy.readers.find_files_and_readers is slow. This is particularly clear for instruments with many files and many file patterns. For ABI L1b, an hour of M6 full disk data consists of 96 files, for M6 CONUS data this is even 196 files. There are 16 file patterns defined in the yaml file. It appears that find_files_and_readers scales poorly with both the number of files and the number of patterns. With 125 files, find_files_and_readers takes nearly 3 seconds, even if all actual disk I/O is filtered out.

To Reproduce

This reproduction is based on current master but with the branch for PR #1169 merged, so that I can implement a dummy filesystem class to exclude I/O from the performance measurement. In this example, glob always returns a list of the same 125 files instantly:

import datetime
import timeit
import satpy.readers
pat = "noaa-goes16/ABI-L1b-Rad{m:s}/2020/{d:>03d}/{h:>02d}/OR_ABI-L1b-RadC-M3C{ch:>02d}_G16_s2020{d:>03d}{h:>02d}00000_e2020{d:>03d}{h:>02d}10000_c2020{d:>03d}{h:>02d}15000.nc"                                                                                                        
ref = [pat.format(m="F", ch=ch, d=d, h=h)
        for ch in range(5)
        for d in range(5)
        for h in range(5)]

class DummyFS:
    def glob(self, _):
        return ref

satpy.readers.find_files_and_readers(
        base_dir="/it/doesnt/matter",
        fs=DummyFS(),
        reader="abi_l1b",
        start_time=datetime.datetime(2020, 1, 1, 1, 30))

Running this with the Python profiler:

python -m cProfile -o mwe37b.prof mwe37.py

gives us a profile file that we can visualise with snakeviz.

Expected behavior

I would expect this to run quickly. Sure, it needs to test 125 files up to 16 times, but a single call to fnmatch only lasts 689 ns ± 45 ns:

In [25]: %timeit fnmatch.fnmatch('noaa-goes16/ABI-L1b-RadF/2020/000/00/OR_ABI-L1b-RadC-M3C00_G16_s20200000000000_e20200000010000_c20200000015000.nc', '??_???-L1b-???*-??C01_???_s?????????????*_e?????????????*_c?????????????*.nc*')
689 ns ± 44.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Even when multiplied by 125*16 this makes just 1.378 milliseconds, 3 orders of magnitude less than find_files_and_readers, so the overhead is somewhere else than in matching the filenames. I think it should be reasonable to expect find_files_and_readers to run at least 10 times faster than it does now and probably in less than 100 ms for this particular case.

Actual results

When running without the profiler, the function takes 2.4 seconds. When running with the profiler, it takes about 6.1-6.3 seconds. The profiling results (see above) reveals that cumulatively, 5.12 seconds are spent in globify, which is apparently called 50020 times. The full profile:

mwe37b.prof.gz

Screenshots

Screenshot of the "Icicle" type visualisation in snakeviz, revealing 6.13 seconds for find_files_and_readers including a cumulative 5.12 seconds in globify:

grafik

When sorting the table of statistics per function by cumulative time, we can see the number of calls for the most time-consuming parts of the code:

grafik

Apparently, there are 20 calls to match_filenames, that's OK, but then 50020 calls to globify, which seems excessive.

Environment Info:

OS: openSUSE 15.0
Satpy Version: Feature branch, branche off master at v0.21.0-14-g65cd83f9 with both #1165 and #1169 merged.
Trollsift version: 0.3.4

Additional context

This may seem such a big deal compared to the I/O involved with reading or downloading the data, nor with the total processing time. However, in fogtools it is considerably slowing down unit tests that simulate downloading of ABI data including file selection for different parameters, which is where I notice this most.

djhoese commented 4 years ago

Nice issue! Thanks for filing this. Just an initial debugging to try to understand the icicle plot: there are 20 ABI file patterns listed in the YAML. The icicles are a little awkward because of generators, but the main starting point is here:

https://github.com/pytroll/satpy/blob/65cd83f9a6bf301e131a1191a622b52609d42ac8/satpy/readers/yaml_reader.py#L497-L506

Assuming self.sorted_filetype_items generates 20 items and self.filename_items_for_filetype calls globify once for each pattern for every per filename (it is actually less than that), then the worst case for this loop should be calling globify 20 * 125 = 2500 times. Right? So where are these 50000 calls to globify coming from? It isn't a recursive function and globify is only called twice in yaml_reader.py:

https://github.com/pytroll/satpy/blob/65cd83f9a6bf301e131a1191a622b52609d42ac8/satpy/readers/yaml_reader.py#L81

And:

https://github.com/pytroll/satpy/blob/65cd83f9a6bf301e131a1191a622b52609d42ac8/satpy/readers/yaml_reader.py#L191

djhoese commented 4 years ago

@gerritholl I'd be curious what kind of difference you see if you move the globify in this function outside the for loop:

https://github.com/pytroll/satpy/blob/65cd83f9a6bf301e131a1191a622b52609d42ac8/satpy/readers/yaml_reader.py#L80-L82

gerritholl commented 4 years ago

I'll dig into checking why it's called so often later. I wonder if there is a tracing tool that can record the call stack every time globify (or any other function) is called and then show the statistics on that. In principle it should be possible so it has probably been done, and I seem to recall having seen such a thing, but it was so long ago that I don't even remember for sure if it was Python. I've done a "monte carlo" type investigation of such matters before by setting a conditional breakpoint with the condition random.random() < x.

djhoese commented 4 years ago

Could even just add a print statement inside globify that uses a global index or do print("globify") and when you run the code from the command line do python my_script.py | grep globify | wc -l. But yeah, doing it the right way would be nice :wink:

gerritholl commented 4 years ago

Using the ancient trace module that comes with Python,

python -m trace -c -C . /tmp/mwe37.py

reveals that

    1: def globify(fmt, keyvals=None):
           """Generate a string usable with glob.glob() from format string
           *fmt* and *keyvals* dictionary.
           """
50020:     if keyvals is None:
50020:         keyvals = {}
50020:     return globify_formatter.format(fmt, **keyvals)

we can tell that globify is called in two places in yaml_reader:

50000:         if fnmatch(get_filebase(filename, pattern), globify(pattern)):
   20:             globified = globify(pattern)

a closer look at match_filenames reveals the loop is executed 50000 times:

    1: def match_filenames(filenames, pattern):
           """Get the filenames matching *pattern*."""
   20:     matching = []

50020:     for filename in filenames:
50000:         if fnmatch(get_filebase(filename, pattern), globify(pattern)):
 2000:             matching.append(filename)

   20:     return matching

Inspecting this with the debugger confirms that len(filenames) == 2500, that's 125 filenames 20 patterns. This function is called 20 times. So something is being inefficient, as instead of `20125=2500it's called2020125=50000` times.

djhoese commented 4 years ago

:confused: So why would filenames be that long? Ok thanks. If you aren't already expecting to work on this I can take a look next week.

djhoese commented 4 years ago

Figured part of it out:

https://github.com/pytroll/satpy/blob/65cd83f9a6bf301e131a1191a622b52609d42ac8/satpy/readers/yaml_reader.py#L190-L193

If the various patterns, as glob patterns, match the same files, there is nothing to stop duplicate filenames from being present until later on when we use sets (which I know we do in some places, but not even sure if in this method).

Edit: Thinking about this more, we should make a set of the glob patterns, then a set of the results.

djhoese commented 4 years ago

Starting to work on this. I'm using a directory with 5763 ABI files in it (mesoscale) and running:

import sys
from datetime import datetime
from satpy.readers import find_files_and_readers

def main():
    base_dir = '/data/satellite/abi/20180911_florence'
    reader = 'abi_l1b'
    st = datetime(2018, 9, 11, 0, 0, 0)
    return find_files_and_readers(
        base_dir=base_dir,
        reader=reader,
        start_time=st,
    )

if __name__ == "__main__":
    sys.exit(main())

In pycharms profiler which calls cProfile.

globify_orig

Globify is called 115220 times and takes up ~11.5s of the total ~16.5s.

djhoese commented 4 years ago

Moving the globify outside of the loop in:

https://github.com/pytroll/satpy/blob/65cd83f9a6bf301e131a1191a622b52609d42ac8/satpy/readers/yaml_reader.py#L80-L82

Cut off 75% of my run time. Here's the new call graph (globify isn't even listed as a main actor):

globify_2

I'll make a PR and then work on additional optimizations.

pytroll / satpy

find_files_and_readers is slow #1172