Speeding up .ang file reader with pandas

pyxem / orix

Analysing crystal orientations and symmetry in Python

https://orix.readthedocs.io

GNU General Public License v3.0

79 stars 45 forks source link

Speeding up .ang file reader with pandas #417

Closed argerlt closed 1 year ago

argerlt commented 1 year ago

This could probably be tacked on to #416, but changing lines 88 and 89 of io\plugins\ang.py from:

    # Read all file data
    file_data = np.loadtxt(filename)

to instead:

    # Read all file data
    import pandas as pd
    file_data = pd.read_csv(filename, comment='#', header=None, sep = "\s+").to_numpy()

roughly doubles the read speed of io.load.

Here is a code snippet using the AF96 datasets to show what I mean.

import pandas as pd
import numpy as np
import time

tic = time.time()
# the numpy way
A = np.loadtxt("4D-XIII-A_cleaned.ang")
A_toc = time.time()-tic

tic = time.time()
# the pandas way, which is then converted to a numpy array
B = pd.read_csv("4D-XIII-A_cleaned.ang", comment='#', header=None, sep = "\s+")
C = B.to_numpy()
B_toc = time.time()-tic

print(A_toc)
print(B_toc)

>>> 13.67614459991455
>>> 8.081970691680908

Really a question of "is adding pandas to orix's dependencies worth the 2x speedup?"

harripj commented 1 year ago

This is a nice speedup! Do you see the same improvement with all supported file types in orix? If so then I think it is worth considering.

Another option would be to not have pandas as an explicit dependency, but use pandas if installed. I think this would be doable as it seems it would only affect the io module.

@hakonanes what do you think?

hakonanes commented 1 year ago

Thank you for looking into speeding up reading, @argerlt.

Whether NumPy or Pandas is fastest seems to me to depend on file size and/or machine architecture, since I find NumPy to be fastest. Reading the AF96 dataset file Field of view 1_EBSD data_Raw.ang seven times in five loops with %timeit gives me the following results:

# file_data = np.loadtxt(filename)
>>> np.__version__
'1.23.5'
>>> %timeit -n 5 xmap = io.load("Field of view 1_EBSD data_Raw.ang")
2.78 s ± 76.8 ms per loop (mean ± std. dev. of 7 runs, 5 loops each)

# file_data = pd.read_csv(filename, comment="#", header=None, sep="\s+").to_numpy()
>>> pd.__version__
'1.5.2'
>>> %timeit -n 5 xmap = io.load("Field of view 1_EBSD data_Raw.ang")
2.91 s ± 67 ms per loop (mean ± std. dev. of 7 runs, 5 loops each)

I suggest to continue using NumPy until more people report a speed-up > 1.7x on their machines.

argerlt commented 1 year ago

Quick update: Looking at your speeds, it surprised me just how much faster your numpy was compared to mine. Then I found this bullet point in the changelog for numpy 1.23.0:

... The highlights are:

Implementation of loadtxt in C, greatly improving its performance.

Exposing DLPack at the Python level for easy data exchange.

Changes to the promotion and comparisons of structured dtypes.

Improvements to f2py.

This completely negates the need for pandas. The only thing that separated panda's reader from numpy's was the usage of a Cythonic reader, which numpy now has. Side note, I've been using this speedup trick since 2017, and it's exciting to see that it's finally obsolete.

I'm closing this, and just adding the comment that setting the required numpy package to >= 1.23 will automatically speed up orix.io.load by roughly a factor of 2 or more.

hakonanes commented 1 year ago

There it is, thank you for searching NumPy's changelog for the cause of the discrepancy.

adding the comment that setting the required numpy package to >= 1.23 will automatically speed up orix.io.load by roughly a factor of 2 or more.

Good point. I myself regularly update my environment's packages to use the latest versions.

Side note: Currently we don't have a lower bound on the NumPy version, but require Matplotlib >= 3.3, which requires NumPy >= 1.19. Requiring >= 1.23 would be too restrictive at the moment, I think, since this is the current minor release.