spec: optimize time to open data file and break into scans - Githubissues

prjemian / spec2nexus

Converts SPEC data files and scans into NeXus HDF5 files

https://prjemian.github.io/spec2nexus/

4 stars 7 forks source link

spec: optimize time to open data file and break into scans #229

Closed prjemian closed 2 years ago

prjemian commented 3 years ago

From https://github.com/APS-USAXS/livedata/issues/55#issuecomment-938326458, it is obvious that the task of reading a SPEC data file scales with the number of scans in the file, more so than the sheer size of the file.

UPDATE: Per finding below, file size is the major factor.

Optimize the time it takes for spec2nexus.spec.SpecDataFile(specFile) to (initially) open a data file and break it into scans. The initial read of a scan will need these attributes available to decide if any of the scans needs a full parse:

attribute	meaning
`date`	the start time of a scan
`scanNum`	the SPEC scan number
`scanCmd`	the SPEC command to start the scan
`raw`	the complete text of the scan

Thanks, @jilavsky, for providing a stunning example of a single data file with 4,800+ scans.

prjemian commented 2 years ago

Examining all the SPEC data files (ca. 3 dozen) in the repository to test execution time of spec.specDataFile(file), compare against number of scans in the file:

A correlation exists but the chart shows that other factors are obvious additional influences on the execution time.

prjemian commented 2 years ago

Among these additional factors, the number of lines in the file:

prjemian commented 2 years ago

Among these additional factors, the file size is the strongest correlation:

prjemian commented 2 years ago

The three methods to parse the file into scans:

method	comments
state machine	the existing method, accumulate each line split when a new block starts
str compare	find line offset for each block, then slice full buffer (slightly faster than state machine since it does not call `list.append()` as much)
regexp	use regular expression parser to find offsets

prjemian commented 2 years ago

The file loader and parser is near optimal, amongst these methods. Will keep the str compare method and discard the other two.

prjemian commented 2 years ago

For completeness, here is the python code and the measured data:

"""
Development for issue #229: optimize the load speed
"""

import pathlib
import spec2nexus
from spec2nexus import spec
import time

def main():
    print(f"{spec2nexus.__version__ = }")
    examples = pathlib.Path(spec.__file__).parent / "data"

    def get_test_files():
        path = examples.parent / "tests" / "data"
        yield from examples.iterdir()
        yield from path.iterdir()

    for example_file in get_test_files():
        if not spec.is_spec_file(str(example_file)):
            continue

        with open(example_file, "r") as fp:
            buf = fp.read()
            size = len(buf) / 1024. / 1024.
            lines = len(buf.splitlines())

        t0 = time.time()
        specData = spec.SpecDataFile(str(example_file))
        dt = time.time() - t0
        print(
            f"{size:.5f}"
            f"  {lines}"
            f"  {len(specData.scans)}"
            f"  {dt:.5f}"
            f"  {example_file.name}"
        )

if __name__ == "__main__":
    t0 = time.time()
    main()
    print(f"{(time.time() - t0) = :.3f}s")

MB	lines	scans	state machine	regexp	str compare	SPEC data file
0.00079	19	1	0.0002	0.00034	0.00022	issue216_scan1.spec
0.00175	40	3	0.0004	0.00055	0.00037	refresh1.txt
0.00208	32	2	0.0003	0.00046	0.00024	refresh2.txt
0.00303	38	1	0.0002	0.00063	0.00019	refresh3.txt
0.00307	69	1	0.0003	0.00055	0.00024	n_m.txt
0.00371	64	1	0.0004	0.00068	0.00039	issue196_data.txt
0.00373	64	1	0.0003	0.00069	0.00039	issue196_data2.txt
0.00531	94	1	0.0014	0.00171	0.00129	issue64_data.txt
0.00851	160	1	0.0022	0.00282	0.00211	test_3_error.spec
0.00879	119	2	0.0006	0.00142	0.00067	user6idd.dat
0.01261	164	1	0.0014	0.00253	0.0014	issue82_data.txt
0.02111	357	1	0.0024	0.00454	0.00237	test_3.spec
0.02127	358	1	0.0023	0.00422	0.00241	test_4.spec
0.02996	378	7	0.0011	0.00386	0.00096	usaxs-bluesky-specwritercallback.dat
0.09876	1754	39	0.005	0.01391	0.00518	05_02_test.dat
0.1485	2162	20	0.0048	0.01838	0.0042	APS_spec_data.dat
0.20014	3251	50	0.0084	0.02806	0.00865	02_03_setup.dat
0.23882	3381	39	0.0081	0.03028	0.00807	issue119_data.txt
0.31688	3388	41	0.0078	0.03793	0.00797	issue161_spock_spec_file
0.3484	6238	3	0.0065	0.0405	0.00623	issue109_data.txt
0.41327	5897	62	0.0142	0.05246	0.01445	03_06_JanTest.dat
1.17703	11968	17	0.1192	0.23057	0.11814	33bm_spec.dat
1.19429	6420	74	0.0273	0.13604	0.02776	CdOsO
1.24014	7000	102	0.0305	0.13988	0.0311	CdSe
1.2558	33786	106	0.0472	0.15173	0.04648	33id_spec.dat
1.47743	15457	114	0.0421	0.17049	0.04254	JL124_1.spc
1.59216	14092	171	0.0531	0.18463	0.0515	spec_from_spock.spc
2.80459	18115	37	0.0461	0.29338	0.04631	YSZ011_ALDITO_Fe2O3_planar_fired_1.spc
3.46591	54520	262	0.1376	0.40644	0.13637	lmn40.spe
6.77357	13867	2	0.0948	0.70005	0.09359	mca_spectra_example.dat
9.28389	106743	53	0.1845	0.97799	0.18753	startup_1.spec
12.91098	189522	878	1.3159	1.74028	1.24546	xpcs_plugin_sample.spec

prjemian commented 2 years ago

Adding the large data file cited in the original statement of the problem, the graphs show more correlation with number of scans:

Also, a trend emerges for files with size s ~10MB and larger, where the time to load t becomes exponentially longer. Below ~5MB, the power law exponent is ~1 and t ~ s^1. Above this size, the exponent increases to >5: t ~ s^5 or higher exponent.

@jilavsky: This is important for some of your users. Try to keep the SPEC file sizes no larger than ~5MB or they will experience slow load times.