ivoflipse / Pawlabeling

Tool for processing and analyzing pressure measurements
Other
18 stars 1 forks source link

MemoryError when loading Zebris treadmill file #53

Closed ivoflipse closed 10 years ago

ivoflipse commented 10 years ago

Here's the traceback:

Traceback (most recent call last):
  File "C:\Dropbox\Development\Pawlabeling\pawlabeling\widgets\processing\processingwidget.py", line 146, in load_file
    pub.sendMessage("load_results", widget="processing")
  File "C:\Anaconda\lib\site-packages\pubsub\core\kwargs\publisher.py", line 30, in sendMessage
    topicObj.publish(**kwargs)
  File "C:\Anaconda\lib\site-packages\pubsub\core\kwargs\publishermixin.py", line 24, in publish
    self._publish(msgKwargs)
  File "C:\Anaconda\lib\site-packages\pubsub\core\topicobj.py", line 340, in _publish
    self.__sendMessage(data, self, iterState)
  File "C:\Anaconda\lib\site-packages\pubsub\core\topicobj.py", line 359, in __sendMessage
    self._mix_callListener(listener, data, iterState)
  File "C:\Anaconda\lib\site-packages\pubsub\core\kwargs\publishermixin.py", line 64, in _mix_callListener
    listener(iterState.filteredArgs, self, msgKwargs)
  File "C:\Anaconda\lib\site-packages\pubsub\core\kwargs\listenerimpl.py", line 27, in __call__
    cb(**kwargs)
  File "C:\Dropbox\Development\Pawlabeling\pawlabeling\models\model.py", line 85, in load_results
    self.load_all_results()
  File "C:\Dropbox\Development\Pawlabeling\pawlabeling\models\model.py", line 118, in load_all_results
    self.track_contacts()
  File "C:\Dropbox\Development\Pawlabeling\pawlabeling\models\model.py", line 141, in track_contacts
    paw.create_contact(contact=raw_paw, measurement_data=self.measurement, padding=1)
  File "C:\Dropbox\Development\Pawlabeling\pawlabeling\models\contactmodel.py", line 76, in create_contact
    self.convert_contour_to_slice(measurement_data)
  File "C:\Dropbox\Development\Pawlabeling\pawlabeling\models\contactmodel.py", line 87, in convert_contour_to_slice
    new_data = np.zeros_like(measurement_data)
  File "C:\Anaconda\lib\site-packages\numpy\core\numeric.py", line 116, in zeros_like
    res = empty_like(a, dtype=dtype, order=order, subok=subok)
MemoryError

I don't really know what's causing the issue, but it seems to be reproducible. I hope its not caused by me trying out Python with 32 bit instead of 64.

ivoflipse commented 10 years ago

And holy crap did I mention Python is using 1.4 Gb, god knows what for.

Given it crashes on other measurements too, it leads me to believe the problem is in changes that I've made to how Zebris files are loaded

ivoflipse commented 10 years ago

I get the same error when trying to load the data in an IPython Notebook. The most likely cause seems to be that I create an empty array of the shape of the input data. But this array is: (128, 56, 2757) so that uses a ton of memory if you have 17+ steps...

Limiting the empty array to the length of each contact, we drastically reduce the memory usage.

ivoflipse commented 10 years ago

If I do some more bookkeeping, I could probably reduce the memory usage even further, by allocating an array of the size the paw will eventually be. However, this means I have to subtract some kind of off-set, which is the origin of the contact slice within the plate, from each sensor I'm trying to copy to the new array.

It might still be worth it, memory-wise, though it'll a bit more computationally expensive.

ivoflipse commented 10 years ago

Well this is interesting, given I made some changes to my memory usage, its at least no longer as bad as it was. When I add the measurements to PyTables, you get a spike in memory usage, because the large files are being loaded.

But once I start calculating average contacts over treadmill measurements, memory usage jumps to 650 MB. For example RawDataGaitAnalysis contains 27(!) footprints...

Perhaps I should consider calculating a true average, instead of keeping all contacts in memory. Since when I switch measurements, it jumps to 1Gb+ and causes a memory error. This is possibly because the change I just made, where load_contacts is called whenever a measurement is selected, so we're doing the work every time, while the memory doesn't seem to be released.

ivoflipse commented 10 years ago

I rewrote calculate_average, such that it doesn't create a 4D array (basically [n_contacts, max_x, max_y, max_z]), but creates a 'true' average summing over all contacts and then multiplying each pixel with 1/n_contacts. That way I don't have to keep additional copies of the contacts in memory, because they can be discarded after they have been processed.

Sadly, switching between measurements still blows my memory uses out of the water (1.4 Gb!), so there must be something fishy going on. Interestingly, it doesn't seem to be related to calculating averages.

ivoflipse commented 10 years ago

Here's a memory profile (got it from here: https://github.com/fabianp/memory_profiler). It shows that indeed, calculate_average is not the problem, it doesn't increase the memory one bit, woehoe! The problem is in the call to track_contacts, which gives a bump in memory of 525 MB.

Looking at it, the specific line causing problems is: for index, raw_contact in enumerate(raw_contacts): Which is odd, given that raw_contacts is just a list of dictionaries that contain contours. Perhaps the count refers to everything that happens inside the loop, instead of the specific line. Guess I'll have to investigate!

C:\Dropbox\Development\Pawlabeling>python -m memory_profiler pawlabeling.py
Filename: pawlabeling\models\model.py

Line #    Mem usage    Increment   Line Contents
================================================
   308                                 @profile
   309                                 def calculate_average(self):
   310   660.465 MB     0.000 MB           # Empty average measurement_data
   311   660.805 MB     0.340 MB           self.average_data.clear()
   312   660.805 MB     0.000 MB           data_list = defaultdict(list)
   313
   314   660.805 MB     0.000 MB           mx = 0
   315   660.805 MB     0.000 MB           my = 0
   316   660.477 MB    -0.328 MB           mz = 0
   317                                     # Group all the measurement_data percontact
   318   660.465 MB    -0.012 MB           for measurement_name, contacts in self.contacts.items():
   319   660.805 MB     0.340 MB               for contact in contacts:
   320   660.805 MB     0.000 MB                   contact_label = contact.contact_label
   321   660.805 MB     0.000 MB                   if contact_label >= 0:
   322                                                 data_list[contact_label].append(contact.data)
   323                                                 x, y, z = contact.data.shape
   324                                                 if x > mx:
   325                                                     mx = x
   326                                                 if y > my:
   327                                                     my = y
   328                                                 if z > mz:
   329                                                     mz = z
   330
   331   660.805 MB     0.000 MB           shape = (mx, my, mz)
   332                                     # Then get the normalized measurement_data
   333   660.805 MB     0.000 MB           for contact_label, data in data_list.items():
   334                                         normalized_data = utility.calculate_average_data(data, shape)
   335                                         self.average_data[contact_label]= normalized_data

Filename: pawlabeling\models\model.py

Line #    Mem usage    Increment   Line Contents
================================================
   228                                 @profile
   229                                 def load_contacts(self):
   230                                     """
   231                                     Check if there if any measurements for this subject have already been processed
   232                                     If so, retrieve the measurement_data and convert them to a usable format
   233   135.098 MB     0.000 MB           """
   234   135.102 MB     0.004 MB           self.logger.info("Model.load_contacts: Loading all measurements for subject: {}, session: {}".format(
   235   135.102 MB     0.000 MB               self.subject_name, self.session["session_name"]))
   236
   237                                     # Make sure self.contacts is empty
   238   135.102 MB     0.000 MB           self.contacts.clear()
   239   135.102 MB     0.000 MB           self.n_max = 0
   240
   241   135.102 MB     0.000 MB           measurement_names = {}
   242   135.125 MB     0.023 MB           for m in self.measurements_table.measurements_table:
   243   135.125 MB     0.000 MB               measurement_names[m["measurement_id"]] = m["measurement_name"]
   244   135.125 MB     0.000 MB               n_max = m["maximum_value"]
   245   135.109 MB    -0.016 MB               if n_max > self.n_max:
   246   135.109 MB     0.000 MB                   self.n_max = n_max
   247
   248   135.156 MB     0.047 MB               contacts = self.get_contact_data(m)
   249   135.156 MB     0.000 MB               if contacts:
   250                                             self.contacts[m["measurement_name"]] = contacts
   251
   252   135.156 MB     0.000 MB           if self.measurement_name not in self.contacts:
   253   660.465 MB   525.309 MB               self.contacts[self.measurement_name] = self.track_contacts()
   254
   255                                     # Calculate the highest n_max and publish that
   256   660.465 MB     0.000 MB           pub.sendMessage("update_n_max", n_max=self.n_max)
   257   660.465 MB     0.000 MB           pub.sendMessage("update_contacts", contacts=self.contacts)
   258                                     # Calculate the average, after everything has been loaded
   259   660.465 MB     0.000 MB           self.calculate_average()
   260                                     # These two messages could pretty much be consolidated, possibly even the one above
   261   660.477 MB     0.012 MB           pub.sendMessage("processing_results", contacts=self.contacts, average_data=self.average_data)
   262   660.809 MB     0.332 MB           pub.sendMessage("update_contacts_tree", contacts=self.contacts)

Filename: pawlabeling\models\model.py

Line #    Mem usage    Increment   Line Contents
================================================
   264                                 @profile
   265   135.156 MB     0.000 MB       def track_contacts(self):
   266   135.160 MB     0.004 MB           pub.sendMessage("update_statusbar", status="Starting tracking")
   267                                     # Add padding to the measurement
   268   135.160 MB     0.000 MB           x = self.measurement["number_of_rows"]
   269   135.160 MB     0.000 MB           y = self.measurement["number_of_cols"]
   270   135.160 MB     0.000 MB           z = self.measurement["number_of_frames"]
   271   135.160 MB     0.000 MB           padding = configuration.padding_factor
   272   177.586 MB    42.426 MB           data = np.zeros((x + 2 * padding, y + 2 * padding, z), np.float32)
   273   177.586 MB     0.000 MB           data[padding:-padding, padding:-padding, :] = self.measurement_data
   274   177.840 MB     0.254 MB           raw_contacts = tracking.track_contours_graph(data)
   275
   276   177.840 MB     0.000 MB           contacts = []
   277                                     # Convert them to class objects
   278   662.496 MB   484.656 MB           for index, raw_contact in enumerate(raw_contacts):
   279   662.496 MB     0.000 MB               contact = contactmodel.Contact()
   280   702.859 MB    40.363 MB               contact.create_contact(contact=raw_contact, measurement_data=self.measurement_data, padding=padding)
   281   702.887 MB     0.027 MB               contact.calculate_results()
   282                                         # Give each contact the same orientation as the measurement it originates from
   283   702.887 MB     0.000 MB               contact.set_orientation(self.measurement["orientation"])
   284                                         # Skip contacts that have only been around for one frame
   285   702.887 MB     0.000 MB               if len(contact.frames) > 1:
   286   702.887 MB     0.000 MB                   contacts.append(contact)
   287
   288                                     # Sort the contacts based on their position along the first dimension
   289   702.887 MB     0.000 MB           contacts = sorted(contacts, key=lambda contact: contact.min_z)
   290                                     # Update their index
   291   702.887 MB     0.000 MB           for contact_id, contact in enumerate(contacts):
   292   702.887 MB     0.000 MB               contact.set_contact_id(contact_id)
   293
   294   702.887 MB     0.000 MB           status = "Number of contacts found: {}".format(len(contacts))
   295   702.887 MB     0.000 MB           pub.sendMessage("update_statusbar", status=status)
   296   702.887 MB     0.000 MB           return contacts
ivoflipse commented 10 years ago

Well would you look at that. I had already identified before that pre-allocating memory using the measurement_data was inefficient, well the above shows how right that was. Here's the profile of the slightly modified version

C:\Dropbox\Development\Pawlabeling>python -m memory_profiler pawlabeling.py
Filename: pawlabeling\models\contactmodel.py

Line #    Mem usage    Increment   Line Contents
================================================
    79                                 @profile
    80                                 def convert_contour_to_slice(self, measurement_data):
    81                                     """
    82                                     Creates self.measurement_data which contains the pixels that are enclosed by the contour
    83                                     """
    84                                     # Create an empty array that should fit the entire contact
    85   177.609 MB     0.000 MB           # TODO reduce this size to the size of the bounding box of the contact (width, length, duration)
    86   181.230 MB     3.621 MB           self.data = np.zeros((self.width, self.height, self.length))
    87
    88   225.969 MB    44.738 MB           for index, (frame, contours) in enumerate(self.contour_list.items()):
    89   181.230 MB   -44.738 MB           # Pass a single contour as if it were a contact
    90   225.969 MB    44.738 MB               center, min_x, max_x, min_y, max_y = utility.update_bounding_box({frame: contours})
    91                                         # Get the non_zero pixels coordinates for that frame
    92   181.230 MB   -44.738 MB               pixels = np.transpose(np.nonzero(measurement_data[min_x:max_x + 1, min_y:max_y + 1, frame]))
    93                                         # Check if they are in any of the contours
    94   199.762 MB    18.531 MB               for pixel in pixels:
    95   225.969 MB    26.207 MB                   for contour in contours:
    96                                                 # Remember the coordinates are only for the slice, so we need to add padding
    97   225.969 MB     0.000 MB                       coordinate = (min_x + pixel[0], min_y + pixel[1])
    98   225.969 MB     0.000 MB                       if cv2.pointPolygonTest(contour, coordinate,0) > -1.0:
    99   225.969 MB     0.000 MB                           self.data[coordinate[0]-self.min_x, coordinate[1]-self.min_y, index] = measurement_data[
   100                                                         coordinate[0], coordinate[1], frame]

Filename: pawlabeling\models\model.py

Line #    Mem usage    Increment   Line Contents
================================================
   263                                 @profile
   264   134.691 MB     0.000 MB       def track_contacts(self):
   265   134.703 MB     0.012 MB           pub.sendMessage("update_statusbar", status="Starting tracking")
   266                                     # Add padding to the measurement
   267   134.703 MB     0.000 MB           x = self.measurement["number_of_rows"]
   268   134.703 MB     0.000 MB           y = self.measurement["number_of_cols"]
   269   134.703 MB     0.000 MB           z = self.measurement["number_of_frames"]
   270   134.703 MB     0.000 MB           padding = configuration.padding_factor
   271   177.129 MB    42.426 MB           data = np.zeros((x + 2 * padding, y + 2 * padding, z), np.float32)
   272   177.129 MB     0.000 MB           data[padding:-padding, padding:-padding, :] = self.measurement_data
   273   177.609 MB     0.480 MB           raw_contacts = tracking.track_contours_graph(data)
   274
   275   177.609 MB     0.000 MB           contacts = []
   276                                     # Convert them to class objects
   277   220.707 MB    43.098 MB           for index, raw_contact in enumerate(raw_contacts):
   278   220.707 MB     0.000 MB               contact = contactmodel.Contact()
   279   225.969 MB     5.262 MB               contact.create_contact(contact=raw_contact, measurement_data=self.measurement_data, padding=padding)
   280   225.973 MB     0.004 MB               contact.calculate_results()
   281                                         # Give each contact the same orientation as the measurement it originates from
   282   225.973 MB     0.000 MB               contact.set_orientation(self.measurement["orientation"])
   283                                         # Skip contacts that have only been around for one frame
   284   225.973 MB     0.000 MB               if len(contact.frames) > 1:
   285   225.973 MB     0.000 MB                   contacts.append(contact)
   286
   287                                     # Sort the contacts based on their position along the first dimension
   288   225.973 MB     0.000 MB           contacts = sorted(contacts, key=lambda contact: contact.min_z)
   289                                     # Update their index
   290   225.973 MB     0.000 MB           for contact_id, contact in enumerate(contacts):
   291   225.973 MB     0.000 MB               contact.set_contact_id(contact_id)
   292
   293   225.973 MB     0.000 MB           status = "Number of contacts found: {}".format(len(contacts))
   294   225.973 MB     0.000 MB           pub.sendMessage("update_statusbar", status=status)
   295   225.973 MB     0.000 MB           return contacts

Now creating 12 contacts (though there ought to be 27) only costs 40 MB, or roughly ~3.6 MB per contact. Compare that too 40MB per contact... So this memory problem seems no more. Though clearing cache more diligently should win me a bit more.