prjemian / spec2nexus

Converts SPEC data files and scans into NeXus HDF5 files
https://prjemian.github.io/spec2nexus/
4 stars 7 forks source link

spec: optimize time to open data file and break into scans #229

Closed prjemian closed 2 years ago

prjemian commented 3 years ago

From https://github.com/APS-USAXS/livedata/issues/55#issuecomment-938326458, it is obvious that the task of reading a SPEC data file scales with the number of scans in the file, more so than the sheer size of the file.

UPDATE: Per finding below, file size is the major factor.

Optimize the time it takes for spec2nexus.spec.SpecDataFile(specFile) to (initially) open a data file and break it into scans. The initial read of a scan will need these attributes available to decide if any of the scans needs a full parse:

attribute meaning
date the start time of a scan
scanNum the SPEC scan number
scanCmd the SPEC command to start the scan
raw the complete text of the scan

Thanks, @jilavsky, for providing a stunning example of a single data file with 4,800+ scans.

prjemian commented 2 years ago

Examining all the SPEC data files (ca. 3 dozen) in the repository to test execution time of spec.specDataFile(file), compare against number of scans in the file:

image

A correlation exists but the chart shows that other factors are obvious additional influences on the execution time.

prjemian commented 2 years ago

Among these additional factors, the number of lines in the file:

image

prjemian commented 2 years ago

Among these additional factors, the file size is the strongest correlation:

image

prjemian commented 2 years ago

The three methods to parse the file into scans:

method comments
state machine the existing method, accumulate each line split when a new block starts
str compare find line offset for each block, then slice full buffer (slightly faster than state machine since it does not call list.append() as much)
regexp use regular expression parser to find offsets
prjemian commented 2 years ago

The file loader and parser is near optimal, amongst these methods. Will keep the str compare method and discard the other two.

prjemian commented 2 years ago

For completeness, here is the python code and the measured data:

"""
Development for issue #229: optimize the load speed
"""

import pathlib
import spec2nexus
from spec2nexus import spec
import time

def main():
    print(f"{spec2nexus.__version__ = }")
    examples = pathlib.Path(spec.__file__).parent / "data"

    def get_test_files():
        path = examples.parent / "tests" / "data"
        yield from examples.iterdir()
        yield from path.iterdir()

    for example_file in get_test_files():
        if not spec.is_spec_file(str(example_file)):
            continue

        with open(example_file, "r") as fp:
            buf = fp.read()
            size = len(buf) / 1024. / 1024.
            lines = len(buf.splitlines())

        t0 = time.time()
        specData = spec.SpecDataFile(str(example_file))
        dt = time.time() - t0
        print(
            f"{size:.5f}"
            f"  {lines}"
            f"  {len(specData.scans)}"
            f"  {dt:.5f}"
            f"  {example_file.name}"
        )

if __name__ == "__main__":
    t0 = time.time()
    main()
    print(f"{(time.time() - t0) = :.3f}s")

MB lines scans state machine regexp str compare SPEC data file
0.00079 19 1 0.0002 0.00034 0.00022 issue216_scan1.spec
0.00175 40 3 0.0004 0.00055 0.00037 refresh1.txt
0.00208 32 2 0.0003 0.00046 0.00024 refresh2.txt
0.00303 38 1 0.0002 0.00063 0.00019 refresh3.txt
0.00307 69 1 0.0003 0.00055 0.00024 n_m.txt
0.00371 64 1 0.0004 0.00068 0.00039 issue196_data.txt
0.00373 64 1 0.0003 0.00069 0.00039 issue196_data2.txt
0.00531 94 1 0.0014 0.00171 0.00129 issue64_data.txt
0.00851 160 1 0.0022 0.00282 0.00211 test_3_error.spec
0.00879 119 2 0.0006 0.00142 0.00067 user6idd.dat
0.01261 164 1 0.0014 0.00253 0.0014 issue82_data.txt
0.02111 357 1 0.0024 0.00454 0.00237 test_3.spec
0.02127 358 1 0.0023 0.00422 0.00241 test_4.spec
0.02996 378 7 0.0011 0.00386 0.00096 usaxs-bluesky-specwritercallback.dat
0.09876 1754 39 0.005 0.01391 0.00518 05_02_test.dat
0.1485 2162 20 0.0048 0.01838 0.0042 APS_spec_data.dat
0.20014 3251 50 0.0084 0.02806 0.00865 02_03_setup.dat
0.23882 3381 39 0.0081 0.03028 0.00807 issue119_data.txt
0.31688 3388 41 0.0078 0.03793 0.00797 issue161_spock_spec_file
0.3484 6238 3 0.0065 0.0405 0.00623 issue109_data.txt
0.41327 5897 62 0.0142 0.05246 0.01445 03_06_JanTest.dat
1.17703 11968 17 0.1192 0.23057 0.11814 33bm_spec.dat
1.19429 6420 74 0.0273 0.13604 0.02776 CdOsO
1.24014 7000 102 0.0305 0.13988 0.0311 CdSe
1.2558 33786 106 0.0472 0.15173 0.04648 33id_spec.dat
1.47743 15457 114 0.0421 0.17049 0.04254 JL124_1.spc
1.59216 14092 171 0.0531 0.18463 0.0515 spec_from_spock.spc
2.80459 18115 37 0.0461 0.29338 0.04631 YSZ011_ALDITO_Fe2O3_planar_fired_1.spc
3.46591 54520 262 0.1376 0.40644 0.13637 lmn40.spe
6.77357 13867 2 0.0948 0.70005 0.09359 mca_spectra_example.dat
9.28389 106743 53 0.1845 0.97799 0.18753 startup_1.spec
12.91098 189522 878 1.3159 1.74028 1.24546 xpcs_plugin_sample.spec
prjemian commented 2 years ago

Adding the large data file cited in the original statement of the problem, the graphs show more correlation with number of scans:

image

image

image

Also, a trend emerges for files with size s ~10MB and larger, where the time to load t becomes exponentially longer. Below ~5MB, the power law exponent is ~1 and t ~ s^1. Above this size, the exponent increases to >5: t ~ s^5 or higher exponent.

@jilavsky: This is important for some of your users. Try to keep the SPEC file sizes no larger than ~5MB or they will experience slow load times.