Closed habtie-phys closed 2 years ago
Can you supply a sample file that I can use to benchmark and the code you are using to load it in h5py and mat73?
I can see what can be done but I'm not optimistic I can optimize the code a lot.
Sent from mobile
Am 23.11.2021 um 21:17 schrieb Habtamu Wubie Tesfaw @.***>:
I like your package very much. Especially, I find your package is capable of reading string data types stored in mat files without any problem, unlike the h5py package. But it is extremely slow during the loading stage. As an example, while the h5py takes only 0.6 s to load a mat file, yours takes about 5 s to load a similar mat file. Could you fix this issue?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.
Can you supply a sample file that I can use to benchmark and the code you are using to load it in h5py and mat73? I can see what can be done but I'm not optimistic I can optimize the code a lot. … ----------- Sent from mobile Am 23.11.2021 um 21:17 schrieb Habtamu Wubie Tesfaw @.***>: I like your package very much. Especially, I find your package is capable of reading string data types stored in mat files without any problem, unlike the h5py package. But it is extremely slow during the loading stage. As an example, while the h5py takes only 0.6 s to load a mat file, yours takes about 5 s to load a similar mat file. Could you fix this issue? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.
Sorry for the late reply. I have attached a test notebook code. I measured the times by running it on VS code notebook extension. loadmat.zip
Thanks! That is indeed slow. I ran it through the profiler and the way I check whether something is a h5py reference (e.g. cell objects) was suboptimal. should be ~50% faster now. h5py
is much faster, as it does not do many conversions and typechecking in the background. Most of the overhead is just converting stuff to numpy/arrays/lists/dicts
Wrote profile results to c:/users/simon/.spyder-py3/lineprofiler.results
Timer unit: 1e-06 s
Total time: 2.72264 s
File: C:/Users/Simon/Desktop/mat7.3/mat73/__init__.py
Function: mat2dict at line 47
Line # Hits Time Per Hit % Time Line Contents
==============================================================
47 @profile
48 def mat2dict(self, hdf5, only_load=None):
49 1 77.3 77.3 0.0 if '#refs#' in hdf5:
50 1 44.6 44.6 0.0 self.refs = hdf5['#refs#']
51 1 0.9 0.9 0.0 d = self._dict_class()
52 3 30.0 10.0 0.0 for var in hdf5:
53 2 1.2 0.6 0.0 if var in ['#refs#','#subsystem#']:
54 1 0.4 0.4 0.0 continue
55 1 21.8 21.8 0.0 ext = os.path.splitext(hdf5.filename)[1].lower()
56 1 0.6 0.6 0.0 if ext.lower()=='.mat':
57 # if hdf5
58 1 2722466.9 2722466.9 100.0 d[var] = self.unpack_mat(hdf5[var])
59 elif ext=='.h5' or ext=='.hdf5':
60 err = 'Can only load .mat. Please use package hdfdict instead'\
61 '\npip install hdfdict\n' \
62 'https://github.com/SiggiGue/hdfdict'
63 raise NotImplementedError(err)
64 else:
65 raise ValueError('can only unpack .mat')
66 1 0.4 0.4 0.0 return d
Total time: 2.54645 s
File: C:/Users/Simon/Desktop/mat7.3/mat73/__init__.py
Function: unpack_mat at line 68
Line # Hits Time Per Hit % Time Line Contents
==============================================================
68 @profile
69 def unpack_mat(self, hdf5, depth=0):
70 """
71 unpack a h5py entry: if it's a group expand,
72 if it's a dataset convert
73
74 for safety reasons, the depth cannot be larger than 99
75 """
76 6019 6274.0 1.0 0.2 if depth==99:
77 raise RecursionError("Maximum number of 99 recursions reached.")
78 6019 30971.6 5.1 1.2 if isinstance(hdf5, (h5py._hl.group.Group)):
79 1191 1480.6 1.2 0.1 d = self._dict_class()
80
81 6014 41779.8 6.9 1.6 for key in hdf5:
82 4823 504432.7 104.6 19.8 matlab_class = hdf5[key].attrs.get('MATLAB_class')
83 4823 220669.1 45.8 8.7 elem = hdf5[key]
84 4823 11982.9 2.5 0.5 unpacked = self.unpack_mat(elem, depth=depth+1)
85 4823 5594.8 1.2 0.2 if matlab_class==b'struct' and len(elem)>1 and \
86 isinstance(unpacked, dict):
87 values = unpacked.values()
88 # we can only pack them together in MATLAB style if
89 # all subitems are the same lengths.
90 # MATLAB is a bit confusing here, and I hope this is
91 # correct. see https://github.com/skjerns/mat7.3/issues/6
92 allist = all([isinstance(item, list) for item in values])
93 if allist:
94 same_len = len(set([len(item) for item in values]))==1
95 else:
96 same_len = False
97
98 # convert struct to its proper form as in MATLAB
99 # i.e. struct[0]['key'] will access the elements
100 # we only recreate the MATLAB style struct
101 # if all the subelements have the same length
102 # and are of type list
103 if allist and same_len:
104 items = list(zip(*[v for v in values]))
105
106 keys = unpacked.keys()
107 struct = [{k:v for v,k in zip(row, keys)} for row in items]
108 struct = [self._dict_class(d) for d in struct]
109 unpacked = struct
110 4823 5442.7 1.1 0.2 d[key] = unpacked
111
112 1191 1156.7 1.0 0.0 return d
113 4828 6112.2 1.3 0.2 elif isinstance(hdf5, h5py._hl.dataset.Dataset):
114 4828 1710549.7 354.3 67.2 return self.convert_mat(hdf5, depth)
115 else:
116 raise Exception(f'Unknown hdf5 type: {key}:{type(hdf5)}')
Total time: 0.62705 s
File: C:/Users/Simon/Desktop/mat7.3/mat73/__init__.py
Function: _has_refs at line 118
Line # Hits Time Per Hit % Time Line Contents
==============================================================
118 @profile
119 def _has_refs(self, dataset):
120 4828 30680.6 6.4 4.9 if len(dataset.shape)<2: return False
121 4816 593462.3 123.2 94.6 if isinstance(dataset[0][0], h5py.h5r.Reference):
122 1 0.6 0.6 0.0 return True
123 4815 2906.1 0.6 0.5 return False
Total time: 1.90248 s
File: C:/Users/Simon/Desktop/mat7.3/mat73/__init__.py
Function: convert_mat at line 125
Line # Hits Time Per Hit % Time Line Contents
==============================================================
125 @profile
126 def convert_mat(self, dataset, depth):
127 """
128 Converts h5py.dataset into python native datatypes
129 according to the matlab class annotation
130 """
131 # all MATLAB variables have the attribute MATLAB_class
132 # if this is not present, it is not convertible
133 4828 60618.9 12.6 3.2 if not 'MATLAB_class' in dataset.attrs and not self._has_refs(dataset):
134 if self.verbose:
135 message = 'ERROR: not a MATLAB datatype: ' + \
136 '{}, ({})'.format(dataset, dataset.dtype)
137 logging.error(message)
138 return None
139
140 4828 651002.7 134.8 34.2 if self._has_refs(dataset):
141 1 1.2 1.2 0.0 mtype='cell'
142 4827 80533.7 16.7 4.2 elif 'MATLAB_empty' in dataset.attrs.keys() and \
143 12 435.8 36.3 0.0 dataset.attrs['MATLAB_class'].decode()in ['cell', 'struct']:
144 mtype = 'empty'
145 else:
146 4827 178374.9 37.0 9.4 mtype = dataset.attrs['MATLAB_class'].decode()
147
148
149 4828 6322.1 1.3 0.3 if mtype=='cell':
150 1 1.1 1.1 0.0 cell = []
151 240 28128.0 117.2 1.5 for ref in dataset:
152 239 345.6 1.4 0.0 row = []
153 # some weird style MATLAB have no refs, but direct floats or int
154 239 1743.3 7.3 0.1 if isinstance(ref, Iterable):
155 1434 2027.3 1.4 0.1 for r in ref:
156 1195 273473.3 228.8 14.4 entry = self.unpack_mat(self.refs.get(r), depth+1)
157 1195 1798.3 1.5 0.1 row.append(entry)
158 else:
159 row = [ref]
160 239 298.2 1.2 0.0 cell.append(row)
161 1 67.0 67.0 0.0 cell = list(map(list, zip(*cell))) # transpose cell
162 1 1.4 1.4 0.0 if len(cell)==1: # singular cells are interpreted as int/float
163 cell = cell[0]
164 1 1.0 1.0 0.0 return cell
165
166 4827 5042.0 1.0 0.3 elif mtype=='empty':
167 dims = [x for x in dataset]
168 return empty(*dims)
169
170 4827 5072.8 1.1 0.3 elif mtype=='char':
171 2387 223625.0 93.7 11.8 string_array = np.array(dataset).ravel()
172 2387 87026.0 36.5 4.6 string_array = ''.join([chr(x) for x in string_array])
173 2387 3942.6 1.7 0.2 string_array = string_array.replace('\x00', '')
174 2387 2491.1 1.0 0.1 return string_array
175
176 2440 2524.9 1.0 0.1 elif mtype=='bool':
177 return bool(dataset)
178
179 2440 2556.8 1.0 0.1 elif mtype=='logical':
180 2 221.8 110.9 0.0 arr = np.array(dataset, dtype=bool).T.squeeze()
181 2 7.5 3.8 0.0 if arr.size==1: arr=bool(arr)
182 2 2.0 1.0 0.0 return arr
183
184 2438 2561.8 1.1 0.1 elif mtype=='canonical empty':
185 5 5.0 1.0 0.0 return None
186
187 # complex numbers need to be filtered out separately
188 2433 40604.8 16.7 2.1 elif 'imag' in str(dataset.dtype):
189 if dataset.attrs['MATLAB_class']==b'single':
190 dtype = np.complex64
191 else:
192 dtype = np.complex128
193 arr = np.array(dataset)
194 arr = (arr['real'] + arr['imag']*1j).astype(dtype)
195 return arr.T.squeeze()
196
197 # if it is none of the above, we can convert to numpy array
198 2433 2723.7 1.1 0.1 elif mtype in ('double', 'single', 'int8', 'int16', 'int32', 'int64',
199 2433 2837.3 1.2 0.1 'uint8', 'uint16', 'uint32', 'uint64'):
200 2433 230346.9 94.7 12.1 arr = np.array(dataset, dtype=dataset.dtype)
201 2433 5716.7 2.3 0.3 return arr.T.squeeze()
202 elif mtype=='missing':
203 arr = None
204 else:
205 if self.verbose:
206 message = 'ERROR: MATLAB type not supported: ' + \
207 '{}, ({})'.format(mtype, dataset.dtype)
208 logging.error(message)
209 return None
Thank you for the detailed explanation. I will check out the updated version.
Perfect, I made some further improvements in the MATLAB type decision logic, now it should be even much faster. I also think I found some further points for improvement, but I think it's already 60-70% faster now.
I like your package very much. Especially, I found that it is capable of reading string data types stored in mat files without any problem, unlike the h5py package. But it is extremely slow during the loading stage. As an example, while the h5py package takes only 0.6 s to load a mat file, yours takes about 5 s to load a similar mat file. Could you fix this issue?