Closed ivoflipse closed 10 years ago
And holy crap did I mention Python is using 1.4 Gb, god knows what for.
Given it crashes on other measurements too, it leads me to believe the problem is in changes that I've made to how Zebris files are loaded
I get the same error when trying to load the data in an IPython Notebook. The most likely cause seems to be that I create an empty array of the shape of the input data. But this array is: (128, 56, 2757) so that uses a ton of memory if you have 17+ steps...
Limiting the empty array to the length of each contact, we drastically reduce the memory usage.
If I do some more bookkeeping, I could probably reduce the memory usage even further, by allocating an array of the size the paw will eventually be. However, this means I have to subtract some kind of off-set, which is the origin of the contact slice within the plate, from each sensor I'm trying to copy to the new array.
It might still be worth it, memory-wise, though it'll a bit more computationally expensive.
Well this is interesting, given I made some changes to my memory usage, its at least no longer as bad as it was. When I add the measurements to PyTables, you get a spike in memory usage, because the large files are being loaded.
But once I start calculating average contacts over treadmill measurements, memory usage jumps to 650 MB. For example RawDataGaitAnalysis contains 27(!) footprints...
Perhaps I should consider calculating a true average, instead of keeping all contacts in memory. Since when I switch measurements, it jumps to 1Gb+ and causes a memory error. This is possibly because the change I just made, where load_contacts is called whenever a measurement is selected, so we're doing the work every time, while the memory doesn't seem to be released.
I rewrote calculate_average, such that it doesn't create a 4D array (basically [n_contacts, max_x, max_y, max_z]), but creates a 'true' average summing over all contacts and then multiplying each pixel with 1/n_contacts. That way I don't have to keep additional copies of the contacts in memory, because they can be discarded after they have been processed.
Sadly, switching between measurements still blows my memory uses out of the water (1.4 Gb!), so there must be something fishy going on. Interestingly, it doesn't seem to be related to calculating averages.
Here's a memory profile (got it from here: https://github.com/fabianp/memory_profiler). It shows that indeed, calculate_average is not the problem, it doesn't increase the memory one bit, woehoe! The problem is in the call to track_contacts, which gives a bump in memory of 525 MB.
Looking at it, the specific line causing problems is: for index, raw_contact in enumerate(raw_contacts):
Which is odd, given that raw_contacts is just a list of dictionaries that contain contours. Perhaps the count refers to everything that happens inside the loop, instead of the specific line. Guess I'll have to investigate!
C:\Dropbox\Development\Pawlabeling>python -m memory_profiler pawlabeling.py
Filename: pawlabeling\models\model.py
Line # Mem usage Increment Line Contents
================================================
308 @profile
309 def calculate_average(self):
310 660.465 MB 0.000 MB # Empty average measurement_data
311 660.805 MB 0.340 MB self.average_data.clear()
312 660.805 MB 0.000 MB data_list = defaultdict(list)
313
314 660.805 MB 0.000 MB mx = 0
315 660.805 MB 0.000 MB my = 0
316 660.477 MB -0.328 MB mz = 0
317 # Group all the measurement_data percontact
318 660.465 MB -0.012 MB for measurement_name, contacts in self.contacts.items():
319 660.805 MB 0.340 MB for contact in contacts:
320 660.805 MB 0.000 MB contact_label = contact.contact_label
321 660.805 MB 0.000 MB if contact_label >= 0:
322 data_list[contact_label].append(contact.data)
323 x, y, z = contact.data.shape
324 if x > mx:
325 mx = x
326 if y > my:
327 my = y
328 if z > mz:
329 mz = z
330
331 660.805 MB 0.000 MB shape = (mx, my, mz)
332 # Then get the normalized measurement_data
333 660.805 MB 0.000 MB for contact_label, data in data_list.items():
334 normalized_data = utility.calculate_average_data(data, shape)
335 self.average_data[contact_label]= normalized_data
Filename: pawlabeling\models\model.py
Line # Mem usage Increment Line Contents
================================================
228 @profile
229 def load_contacts(self):
230 """
231 Check if there if any measurements for this subject have already been processed
232 If so, retrieve the measurement_data and convert them to a usable format
233 135.098 MB 0.000 MB """
234 135.102 MB 0.004 MB self.logger.info("Model.load_contacts: Loading all measurements for subject: {}, session: {}".format(
235 135.102 MB 0.000 MB self.subject_name, self.session["session_name"]))
236
237 # Make sure self.contacts is empty
238 135.102 MB 0.000 MB self.contacts.clear()
239 135.102 MB 0.000 MB self.n_max = 0
240
241 135.102 MB 0.000 MB measurement_names = {}
242 135.125 MB 0.023 MB for m in self.measurements_table.measurements_table:
243 135.125 MB 0.000 MB measurement_names[m["measurement_id"]] = m["measurement_name"]
244 135.125 MB 0.000 MB n_max = m["maximum_value"]
245 135.109 MB -0.016 MB if n_max > self.n_max:
246 135.109 MB 0.000 MB self.n_max = n_max
247
248 135.156 MB 0.047 MB contacts = self.get_contact_data(m)
249 135.156 MB 0.000 MB if contacts:
250 self.contacts[m["measurement_name"]] = contacts
251
252 135.156 MB 0.000 MB if self.measurement_name not in self.contacts:
253 660.465 MB 525.309 MB self.contacts[self.measurement_name] = self.track_contacts()
254
255 # Calculate the highest n_max and publish that
256 660.465 MB 0.000 MB pub.sendMessage("update_n_max", n_max=self.n_max)
257 660.465 MB 0.000 MB pub.sendMessage("update_contacts", contacts=self.contacts)
258 # Calculate the average, after everything has been loaded
259 660.465 MB 0.000 MB self.calculate_average()
260 # These two messages could pretty much be consolidated, possibly even the one above
261 660.477 MB 0.012 MB pub.sendMessage("processing_results", contacts=self.contacts, average_data=self.average_data)
262 660.809 MB 0.332 MB pub.sendMessage("update_contacts_tree", contacts=self.contacts)
Filename: pawlabeling\models\model.py
Line # Mem usage Increment Line Contents
================================================
264 @profile
265 135.156 MB 0.000 MB def track_contacts(self):
266 135.160 MB 0.004 MB pub.sendMessage("update_statusbar", status="Starting tracking")
267 # Add padding to the measurement
268 135.160 MB 0.000 MB x = self.measurement["number_of_rows"]
269 135.160 MB 0.000 MB y = self.measurement["number_of_cols"]
270 135.160 MB 0.000 MB z = self.measurement["number_of_frames"]
271 135.160 MB 0.000 MB padding = configuration.padding_factor
272 177.586 MB 42.426 MB data = np.zeros((x + 2 * padding, y + 2 * padding, z), np.float32)
273 177.586 MB 0.000 MB data[padding:-padding, padding:-padding, :] = self.measurement_data
274 177.840 MB 0.254 MB raw_contacts = tracking.track_contours_graph(data)
275
276 177.840 MB 0.000 MB contacts = []
277 # Convert them to class objects
278 662.496 MB 484.656 MB for index, raw_contact in enumerate(raw_contacts):
279 662.496 MB 0.000 MB contact = contactmodel.Contact()
280 702.859 MB 40.363 MB contact.create_contact(contact=raw_contact, measurement_data=self.measurement_data, padding=padding)
281 702.887 MB 0.027 MB contact.calculate_results()
282 # Give each contact the same orientation as the measurement it originates from
283 702.887 MB 0.000 MB contact.set_orientation(self.measurement["orientation"])
284 # Skip contacts that have only been around for one frame
285 702.887 MB 0.000 MB if len(contact.frames) > 1:
286 702.887 MB 0.000 MB contacts.append(contact)
287
288 # Sort the contacts based on their position along the first dimension
289 702.887 MB 0.000 MB contacts = sorted(contacts, key=lambda contact: contact.min_z)
290 # Update their index
291 702.887 MB 0.000 MB for contact_id, contact in enumerate(contacts):
292 702.887 MB 0.000 MB contact.set_contact_id(contact_id)
293
294 702.887 MB 0.000 MB status = "Number of contacts found: {}".format(len(contacts))
295 702.887 MB 0.000 MB pub.sendMessage("update_statusbar", status=status)
296 702.887 MB 0.000 MB return contacts
Well would you look at that. I had already identified before that pre-allocating memory using the measurement_data was inefficient, well the above shows how right that was. Here's the profile of the slightly modified version
C:\Dropbox\Development\Pawlabeling>python -m memory_profiler pawlabeling.py
Filename: pawlabeling\models\contactmodel.py
Line # Mem usage Increment Line Contents
================================================
79 @profile
80 def convert_contour_to_slice(self, measurement_data):
81 """
82 Creates self.measurement_data which contains the pixels that are enclosed by the contour
83 """
84 # Create an empty array that should fit the entire contact
85 177.609 MB 0.000 MB # TODO reduce this size to the size of the bounding box of the contact (width, length, duration)
86 181.230 MB 3.621 MB self.data = np.zeros((self.width, self.height, self.length))
87
88 225.969 MB 44.738 MB for index, (frame, contours) in enumerate(self.contour_list.items()):
89 181.230 MB -44.738 MB # Pass a single contour as if it were a contact
90 225.969 MB 44.738 MB center, min_x, max_x, min_y, max_y = utility.update_bounding_box({frame: contours})
91 # Get the non_zero pixels coordinates for that frame
92 181.230 MB -44.738 MB pixels = np.transpose(np.nonzero(measurement_data[min_x:max_x + 1, min_y:max_y + 1, frame]))
93 # Check if they are in any of the contours
94 199.762 MB 18.531 MB for pixel in pixels:
95 225.969 MB 26.207 MB for contour in contours:
96 # Remember the coordinates are only for the slice, so we need to add padding
97 225.969 MB 0.000 MB coordinate = (min_x + pixel[0], min_y + pixel[1])
98 225.969 MB 0.000 MB if cv2.pointPolygonTest(contour, coordinate,0) > -1.0:
99 225.969 MB 0.000 MB self.data[coordinate[0]-self.min_x, coordinate[1]-self.min_y, index] = measurement_data[
100 coordinate[0], coordinate[1], frame]
Filename: pawlabeling\models\model.py
Line # Mem usage Increment Line Contents
================================================
263 @profile
264 134.691 MB 0.000 MB def track_contacts(self):
265 134.703 MB 0.012 MB pub.sendMessage("update_statusbar", status="Starting tracking")
266 # Add padding to the measurement
267 134.703 MB 0.000 MB x = self.measurement["number_of_rows"]
268 134.703 MB 0.000 MB y = self.measurement["number_of_cols"]
269 134.703 MB 0.000 MB z = self.measurement["number_of_frames"]
270 134.703 MB 0.000 MB padding = configuration.padding_factor
271 177.129 MB 42.426 MB data = np.zeros((x + 2 * padding, y + 2 * padding, z), np.float32)
272 177.129 MB 0.000 MB data[padding:-padding, padding:-padding, :] = self.measurement_data
273 177.609 MB 0.480 MB raw_contacts = tracking.track_contours_graph(data)
274
275 177.609 MB 0.000 MB contacts = []
276 # Convert them to class objects
277 220.707 MB 43.098 MB for index, raw_contact in enumerate(raw_contacts):
278 220.707 MB 0.000 MB contact = contactmodel.Contact()
279 225.969 MB 5.262 MB contact.create_contact(contact=raw_contact, measurement_data=self.measurement_data, padding=padding)
280 225.973 MB 0.004 MB contact.calculate_results()
281 # Give each contact the same orientation as the measurement it originates from
282 225.973 MB 0.000 MB contact.set_orientation(self.measurement["orientation"])
283 # Skip contacts that have only been around for one frame
284 225.973 MB 0.000 MB if len(contact.frames) > 1:
285 225.973 MB 0.000 MB contacts.append(contact)
286
287 # Sort the contacts based on their position along the first dimension
288 225.973 MB 0.000 MB contacts = sorted(contacts, key=lambda contact: contact.min_z)
289 # Update their index
290 225.973 MB 0.000 MB for contact_id, contact in enumerate(contacts):
291 225.973 MB 0.000 MB contact.set_contact_id(contact_id)
292
293 225.973 MB 0.000 MB status = "Number of contacts found: {}".format(len(contacts))
294 225.973 MB 0.000 MB pub.sendMessage("update_statusbar", status=status)
295 225.973 MB 0.000 MB return contacts
Now creating 12 contacts (though there ought to be 27) only costs 40 MB, or roughly ~3.6 MB per contact. Compare that too 40MB per contact... So this memory problem seems no more. Though clearing cache more diligently should win me a bit more.
Here's the traceback:
I don't really know what's causing the issue, but it seems to be reproducible. I hope its not caused by me trying out Python with 32 bit instead of 64.