It is extremely slow during loading files

habtie-phys commented 2 years ago

I like your package very much. Especially, I found that it is capable of reading string data types stored in mat files without any problem, unlike the h5py package. But it is extremely slow during the loading stage. As an example, while the h5py package takes only 0.6 s to load a mat file, yours takes about 5 s to load a similar mat file. Could you fix this issue?

skjerns commented 2 years ago

Can you supply a sample file that I can use to benchmark and the code you are using to load it in h5py and mat73?

I can see what can be done but I'm not optimistic I can optimize the code a lot.

Sent from mobile

Am 23.11.2021 um 21:17 schrieb Habtamu Wubie Tesfaw @.***>:

I like your package very much. Especially, I find your package is capable of reading string data types stored in mat files without any problem, unlike the h5py package. But it is extremely slow during the loading stage. As an example, while the h5py takes only 0.6 s to load a mat file, yours takes about 5 s to load a similar mat file. Could you fix this issue?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

habtie-phys commented 2 years ago

Can you supply a sample file that I can use to benchmark and the code you are using to load it in h5py and mat73? I can see what can be done but I'm not optimistic I can optimize the code a lot. … ----------- Sent from mobile Am 23.11.2021 um 21:17 schrieb Habtamu Wubie Tesfaw @.***>: I like your package very much. Especially, I find your package is capable of reading string data types stored in mat files without any problem, unlike the h5py package. But it is extremely slow during the loading stage. As an example, while the h5py takes only 0.6 s to load a mat file, yours takes about 5 s to load a similar mat file. Could you fix this issue? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

Sorry for the late reply. I have attached a test notebook code. I measured the times by running it on VS code notebook extension. loadmat.zip

skjerns commented 2 years ago

Thanks! That is indeed slow. I ran it through the profiler and the way I check whether something is a h5py reference (e.g. cell objects) was suboptimal. should be ~50% faster now. h5py is much faster, as it does not do many conversions and typechecking in the background. Most of the overhead is just converting stuff to numpy/arrays/lists/dicts

Wrote profile results to c:/users/simon/.spyder-py3/lineprofiler.results
Timer unit: 1e-06 s

Total time: 2.72264 s
File: C:/Users/Simon/Desktop/mat7.3/mat73/__init__.py
Function: mat2dict at line 47

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    47                                               @profile
    48                                               def mat2dict(self, hdf5, only_load=None):
    49         1         77.3     77.3      0.0          if '#refs#' in hdf5: 
    50         1         44.6     44.6      0.0              self.refs = hdf5['#refs#']
    51         1          0.9      0.9      0.0          d = self._dict_class()
    52         3         30.0     10.0      0.0          for var in hdf5:
    53         2          1.2      0.6      0.0              if var in ['#refs#','#subsystem#']:
    54         1          0.4      0.4      0.0                  continue
    55         1         21.8     21.8      0.0              ext = os.path.splitext(hdf5.filename)[1].lower()
    56         1          0.6      0.6      0.0              if ext.lower()=='.mat':
    57                                                           # if hdf5
    58         1    2722466.9 2722466.9    100.0                  d[var] = self.unpack_mat(hdf5[var])
    59                                                       elif ext=='.h5' or ext=='.hdf5':
    60                                                           err = 'Can only load .mat. Please use package hdfdict instead'\
    61                                                                 '\npip install hdfdict\n' \
    62                                                                 'https://github.com/SiggiGue/hdfdict'
    63                                                           raise NotImplementedError(err)
    64                                                       else:
    65                                                           raise ValueError('can only unpack .mat')
    66         1          0.4      0.4      0.0          return d

Total time: 2.54645 s
File: C:/Users/Simon/Desktop/mat7.3/mat73/__init__.py
Function: unpack_mat at line 68

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    68                                               @profile
    69                                               def unpack_mat(self, hdf5, depth=0):
    70                                                   """
    71                                                   unpack a h5py entry: if it's a group expand,
    72                                                   if it's a dataset convert
    73                                                   
    74                                                   for safety reasons, the depth cannot be larger than 99
    75                                                   """
    76      6019       6274.0      1.0      0.2          if depth==99:
    77                                                       raise RecursionError("Maximum number of 99 recursions reached.")
    78      6019      30971.6      5.1      1.2          if isinstance(hdf5, (h5py._hl.group.Group)):
    79      1191       1480.6      1.2      0.1              d = self._dict_class()
    80                                           
    81      6014      41779.8      6.9      1.6              for key in hdf5:
    82      4823     504432.7    104.6     19.8                  matlab_class = hdf5[key].attrs.get('MATLAB_class')
    83      4823     220669.1     45.8      8.7                  elem   = hdf5[key]
    84      4823      11982.9      2.5      0.5                  unpacked = self.unpack_mat(elem, depth=depth+1)
    85      4823       5594.8      1.2      0.2                  if matlab_class==b'struct' and len(elem)>1 and \
    86                                                           isinstance(unpacked, dict):
    87                                                               values = unpacked.values()
    88                                                               # we can only pack them together in MATLAB style if
    89                                                               # all subitems are the same lengths.
    90                                                               # MATLAB is a bit confusing here, and I hope this is
    91                                                               # correct. see https://github.com/skjerns/mat7.3/issues/6
    92                                                               allist = all([isinstance(item, list) for item in values])
    93                                                               if allist:
    94                                                                   same_len = len(set([len(item) for item in values]))==1
    95                                                               else:
    96                                                                   same_len = False
    97                                           
    98                                                               # convert struct to its proper form as in MATLAB
    99                                                               # i.e. struct[0]['key'] will access the elements
   100                                                               # we only recreate the MATLAB style struct
   101                                                               # if all the subelements have the same length
   102                                                               # and are of type list
   103                                                               if allist and same_len:
   104                                                                   items = list(zip(*[v for v in values]))
   105                                           
   106                                                                   keys = unpacked.keys()
   107                                                                   struct = [{k:v for v,k in zip(row, keys)} for row in items]
   108                                                                   struct = [self._dict_class(d) for d in struct]
   109                                                                   unpacked = struct
   110      4823       5442.7      1.1      0.2                  d[key] = unpacked
   111                                           
   112      1191       1156.7      1.0      0.0              return d
   113      4828       6112.2      1.3      0.2          elif isinstance(hdf5, h5py._hl.dataset.Dataset):
   114      4828    1710549.7    354.3     67.2              return self.convert_mat(hdf5, depth)
   115                                                   else:
   116                                                       raise Exception(f'Unknown hdf5 type: {key}:{type(hdf5)}')

Total time: 0.62705 s
File: C:/Users/Simon/Desktop/mat7.3/mat73/__init__.py
Function: _has_refs at line 118

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   118                                               @profile
   119                                               def _has_refs(self, dataset):
   120      4828      30680.6      6.4      4.9          if len(dataset.shape)<2: return False
   121      4816     593462.3    123.2     94.6          if isinstance(dataset[0][0], h5py.h5r.Reference):  
   122         1          0.6      0.6      0.0              return True
   123      4815       2906.1      0.6      0.5          return False

Total time: 1.90248 s
File: C:/Users/Simon/Desktop/mat7.3/mat73/__init__.py
Function: convert_mat at line 125

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   125                                               @profile
   126                                               def convert_mat(self, dataset, depth):
   127                                                   """
   128                                                   Converts h5py.dataset into python native datatypes
   129                                                   according to the matlab class annotation
   130                                                   """      
   131                                                   # all MATLAB variables have the attribute MATLAB_class
   132                                                   # if this is not present, it is not convertible
   133      4828      60618.9     12.6      3.2          if not 'MATLAB_class' in dataset.attrs and not self._has_refs(dataset):
   134                                                       if self.verbose:
   135                                                           message = 'ERROR: not a MATLAB datatype: ' + \
   136                                                                     '{}, ({})'.format(dataset, dataset.dtype)
   137                                                           logging.error(message)
   138                                                       return None
   139                                           
   140      4828     651002.7    134.8     34.2          if self._has_refs(dataset):
   141         1          1.2      1.2      0.0              mtype='cell'
   142      4827      80533.7     16.7      4.2          elif 'MATLAB_empty' in dataset.attrs.keys() and \
   143        12        435.8     36.3      0.0              dataset.attrs['MATLAB_class'].decode()in ['cell', 'struct']:
   144                                                       mtype = 'empty'
   145                                                   else:
   146      4827     178374.9     37.0      9.4              mtype = dataset.attrs['MATLAB_class'].decode()
   147                                           
   148                                           
   149      4828       6322.1      1.3      0.3          if mtype=='cell':
   150         1          1.1      1.1      0.0              cell = []
   151       240      28128.0    117.2      1.5              for ref in dataset:
   152       239        345.6      1.4      0.0                  row = []
   153                                                           # some weird style MATLAB have no refs, but direct floats or int
   154       239       1743.3      7.3      0.1                  if isinstance(ref, Iterable):
   155      1434       2027.3      1.4      0.1                      for r in ref:
   156      1195     273473.3    228.8     14.4                          entry = self.unpack_mat(self.refs.get(r), depth+1)
   157      1195       1798.3      1.5      0.1                          row.append(entry)
   158                                                           else:
   159                                                               row = [ref]
   160       239        298.2      1.2      0.0                  cell.append(row)
   161         1         67.0     67.0      0.0              cell = list(map(list, zip(*cell))) # transpose cell
   162         1          1.4      1.4      0.0              if len(cell)==1: # singular cells are interpreted as int/float
   163                                                           cell = cell[0]
   164         1          1.0      1.0      0.0              return cell
   165                                           
   166      4827       5042.0      1.0      0.3          elif mtype=='empty':
   167                                                       dims = [x for x in dataset]
   168                                                       return empty(*dims)
   169                                           
   170      4827       5072.8      1.1      0.3          elif mtype=='char': 
   171      2387     223625.0     93.7     11.8              string_array = np.array(dataset).ravel()
   172      2387      87026.0     36.5      4.6              string_array = ''.join([chr(x) for x in string_array])
   173      2387       3942.6      1.7      0.2              string_array = string_array.replace('\x00', '')
   174      2387       2491.1      1.0      0.1              return string_array
   175                                           
   176      2440       2524.9      1.0      0.1          elif mtype=='bool':
   177                                                       return bool(dataset)
   178                                           
   179      2440       2556.8      1.0      0.1          elif mtype=='logical': 
   180         2        221.8    110.9      0.0              arr = np.array(dataset, dtype=bool).T.squeeze()
   181         2          7.5      3.8      0.0              if arr.size==1: arr=bool(arr)
   182         2          2.0      1.0      0.0              return arr
   183                                           
   184      2438       2561.8      1.1      0.1          elif mtype=='canonical empty': 
   185         5          5.0      1.0      0.0              return None
   186                                           
   187                                                   # complex numbers need to be filtered out separately
   188      2433      40604.8     16.7      2.1          elif 'imag' in str(dataset.dtype):
   189                                                       if dataset.attrs['MATLAB_class']==b'single':
   190                                                           dtype = np.complex64 
   191                                                       else:
   192                                                           dtype = np.complex128
   193                                                       arr = np.array(dataset)
   194                                                       arr = (arr['real'] + arr['imag']*1j).astype(dtype)
   195                                                       return arr.T.squeeze()
   196                                           
   197                                                   # if it is none of the above, we can convert to numpy array
   198      2433       2723.7      1.1      0.1          elif mtype in ('double', 'single', 'int8', 'int16', 'int32', 'int64', 
   199      2433       2837.3      1.2      0.1                         'uint8', 'uint16', 'uint32', 'uint64'):
   200      2433     230346.9     94.7     12.1              arr = np.array(dataset, dtype=dataset.dtype)
   201      2433       5716.7      2.3      0.3              return arr.T.squeeze()
   202                                                   elif mtype=='missing':
   203                                                       arr = None
   204                                                   else:
   205                                                       if self.verbose:
   206                                                           message = 'ERROR: MATLAB type not supported: ' + \
   207                                                                     '{}, ({})'.format(mtype, dataset.dtype)
   208                                                           logging.error(message)
   209                                                       return None

habtie-phys commented 2 years ago

Thank you for the detailed explanation. I will check out the updated version.

skjerns commented 2 years ago

Perfect, I made some further improvements in the MATLAB type decision logic, now it should be even much faster. I also think I found some further points for improvement, but I think it's already 60-70% faster now.

skjerns / mat7.3

It is extremely slow during loading files #23