Closed prjemian closed 2 years ago
Examining all the SPEC data files (ca. 3 dozen) in the repository to test execution time of spec.specDataFile(file)
, compare against number of scans in the file:
A correlation exists but the chart shows that other factors are obvious additional influences on the execution time.
Among these additional factors, the number of lines in the file:
Among these additional factors, the file size is the strongest correlation:
The three methods to parse the file into scans:
method | comments |
---|---|
state machine | the existing method, accumulate each line split when a new block starts |
str compare | find line offset for each block, then slice full buffer (slightly faster than state machine since it does not call list.append() as much) |
regexp | use regular expression parser to find offsets |
The file loader and parser is near optimal, amongst these methods. Will keep the str compare
method and discard the other two.
For completeness, here is the python code and the measured data:
"""
Development for issue #229: optimize the load speed
"""
import pathlib
import spec2nexus
from spec2nexus import spec
import time
def main():
print(f"{spec2nexus.__version__ = }")
examples = pathlib.Path(spec.__file__).parent / "data"
def get_test_files():
path = examples.parent / "tests" / "data"
yield from examples.iterdir()
yield from path.iterdir()
for example_file in get_test_files():
if not spec.is_spec_file(str(example_file)):
continue
with open(example_file, "r") as fp:
buf = fp.read()
size = len(buf) / 1024. / 1024.
lines = len(buf.splitlines())
t0 = time.time()
specData = spec.SpecDataFile(str(example_file))
dt = time.time() - t0
print(
f"{size:.5f}"
f" {lines}"
f" {len(specData.scans)}"
f" {dt:.5f}"
f" {example_file.name}"
)
if __name__ == "__main__":
t0 = time.time()
main()
print(f"{(time.time() - t0) = :.3f}s")
MB | lines | scans | state machine | regexp | str compare | SPEC data file |
---|---|---|---|---|---|---|
0.00079 | 19 | 1 | 0.0002 | 0.00034 | 0.00022 | issue216_scan1.spec |
0.00175 | 40 | 3 | 0.0004 | 0.00055 | 0.00037 | refresh1.txt |
0.00208 | 32 | 2 | 0.0003 | 0.00046 | 0.00024 | refresh2.txt |
0.00303 | 38 | 1 | 0.0002 | 0.00063 | 0.00019 | refresh3.txt |
0.00307 | 69 | 1 | 0.0003 | 0.00055 | 0.00024 | n_m.txt |
0.00371 | 64 | 1 | 0.0004 | 0.00068 | 0.00039 | issue196_data.txt |
0.00373 | 64 | 1 | 0.0003 | 0.00069 | 0.00039 | issue196_data2.txt |
0.00531 | 94 | 1 | 0.0014 | 0.00171 | 0.00129 | issue64_data.txt |
0.00851 | 160 | 1 | 0.0022 | 0.00282 | 0.00211 | test_3_error.spec |
0.00879 | 119 | 2 | 0.0006 | 0.00142 | 0.00067 | user6idd.dat |
0.01261 | 164 | 1 | 0.0014 | 0.00253 | 0.0014 | issue82_data.txt |
0.02111 | 357 | 1 | 0.0024 | 0.00454 | 0.00237 | test_3.spec |
0.02127 | 358 | 1 | 0.0023 | 0.00422 | 0.00241 | test_4.spec |
0.02996 | 378 | 7 | 0.0011 | 0.00386 | 0.00096 | usaxs-bluesky-specwritercallback.dat |
0.09876 | 1754 | 39 | 0.005 | 0.01391 | 0.00518 | 05_02_test.dat |
0.1485 | 2162 | 20 | 0.0048 | 0.01838 | 0.0042 | APS_spec_data.dat |
0.20014 | 3251 | 50 | 0.0084 | 0.02806 | 0.00865 | 02_03_setup.dat |
0.23882 | 3381 | 39 | 0.0081 | 0.03028 | 0.00807 | issue119_data.txt |
0.31688 | 3388 | 41 | 0.0078 | 0.03793 | 0.00797 | issue161_spock_spec_file |
0.3484 | 6238 | 3 | 0.0065 | 0.0405 | 0.00623 | issue109_data.txt |
0.41327 | 5897 | 62 | 0.0142 | 0.05246 | 0.01445 | 03_06_JanTest.dat |
1.17703 | 11968 | 17 | 0.1192 | 0.23057 | 0.11814 | 33bm_spec.dat |
1.19429 | 6420 | 74 | 0.0273 | 0.13604 | 0.02776 | CdOsO |
1.24014 | 7000 | 102 | 0.0305 | 0.13988 | 0.0311 | CdSe |
1.2558 | 33786 | 106 | 0.0472 | 0.15173 | 0.04648 | 33id_spec.dat |
1.47743 | 15457 | 114 | 0.0421 | 0.17049 | 0.04254 | JL124_1.spc |
1.59216 | 14092 | 171 | 0.0531 | 0.18463 | 0.0515 | spec_from_spock.spc |
2.80459 | 18115 | 37 | 0.0461 | 0.29338 | 0.04631 | YSZ011_ALDITO_Fe2O3_planar_fired_1.spc |
3.46591 | 54520 | 262 | 0.1376 | 0.40644 | 0.13637 | lmn40.spe |
6.77357 | 13867 | 2 | 0.0948 | 0.70005 | 0.09359 | mca_spectra_example.dat |
9.28389 | 106743 | 53 | 0.1845 | 0.97799 | 0.18753 | startup_1.spec |
12.91098 | 189522 | 878 | 1.3159 | 1.74028 | 1.24546 | xpcs_plugin_sample.spec |
Adding the large data file cited in the original statement of the problem, the graphs show more correlation with number of scans:
Also, a trend emerges for files with size s
~10MB and larger, where the time to load t
becomes exponentially longer. Below ~5MB, the power law exponent is ~1 and t ~ s^1
. Above this size, the exponent increases to >5: t ~ s^5
or higher exponent.
@jilavsky: This is important for some of your users. Try to keep the SPEC file sizes no larger than ~5MB or they will experience slow load times.
From https://github.com/APS-USAXS/livedata/issues/55#issuecomment-938326458, it is obvious that the task of reading a SPEC data file scales with the number of scans in the file, more so than the sheer size of the file.
UPDATE: Per finding below, file size is the major factor.
Optimize the time it takes for
spec2nexus.spec.SpecDataFile(specFile)
to (initially) open a data file and break it into scans. The initial read of a scan will need these attributes available to decide if any of the scans needs a full parse:date
scanNum
scanCmd
raw
Thanks, @jilavsky, for providing a stunning example of a single data file with 4,800+ scans.