Closed argerlt closed 1 year ago
This is a nice speedup! Do you see the same improvement with all supported file types in orix? If so then I think it is worth considering.
Another option would be to not have pandas as an explicit dependency, but use pandas if installed. I think this would be doable as it seems it would only affect the io module.
@hakonanes what do you think?
Thank you for looking into speeding up reading, @argerlt.
Whether NumPy or Pandas is fastest seems to me to depend on file size and/or machine architecture, since I find NumPy to be fastest. Reading the AF96 dataset file Field of view 1_EBSD data_Raw.ang seven times in five loops with %timeit
gives me the following results:
# file_data = np.loadtxt(filename)
>>> np.__version__
'1.23.5'
>>> %timeit -n 5 xmap = io.load("Field of view 1_EBSD data_Raw.ang")
2.78 s ± 76.8 ms per loop (mean ± std. dev. of 7 runs, 5 loops each)
# file_data = pd.read_csv(filename, comment="#", header=None, sep="\s+").to_numpy()
>>> pd.__version__
'1.5.2'
>>> %timeit -n 5 xmap = io.load("Field of view 1_EBSD data_Raw.ang")
2.91 s ± 67 ms per loop (mean ± std. dev. of 7 runs, 5 loops each)
I suggest to continue using NumPy until more people report a speed-up > 1.7x on their machines.
Quick update: Looking at your speeds, it surprised me just how much faster your numpy was compared to mine. Then I found this bullet point in the changelog for numpy 1.23.0:
... The highlights are:
- Implementation of loadtxt in C, greatly improving its performance.
- Exposing DLPack at the Python level for easy data exchange.
- Changes to the promotion and comparisons of structured dtypes.
- Improvements to f2py.
This completely negates the need for pandas. The only thing that separated panda's reader from numpy's was the usage of a Cythonic reader, which numpy now has. Side note, I've been using this speedup trick since 2017, and it's exciting to see that it's finally obsolete.
I'm closing this, and just adding the comment that setting the required numpy package to >= 1.23 will automatically speed up orix.io.load by roughly a factor of 2 or more.
There it is, thank you for searching NumPy's changelog for the cause of the discrepancy.
adding the comment that setting the required numpy package to >= 1.23 will automatically speed up orix.io.load by roughly a factor of 2 or more.
Good point. I myself regularly update my environment's packages to use the latest versions.
Side note: Currently we don't have a lower bound on the NumPy version, but require Matplotlib >= 3.3, which requires NumPy >= 1.19. Requiring >= 1.23 would be too restrictive at the moment, I think, since this is the current minor release.
This could probably be tacked on to #416, but changing lines 88 and 89 of
io\plugins\ang.py
from:to instead:
roughly doubles the read speed of
io.load
.Here is a code snippet using the AF96 datasets to show what I mean.
Really a question of "is adding pandas to orix's dependencies worth the 2x speedup?"