Store the data in PyTables

ivoflipse commented 10 years ago

After discussing the matter with Todd Pataky, I've decided to store all my data in PyTables, which

is a package for managing hierarchical datasets and designed to efficiently and easily cope with extremely large amounts of data.

PyTables is built on top of the HDF5 library, using the Python language and the NumPy package. It features an object-oriented interface that, combined with C extensions for the performance-critical parts of the code (generated using Cython), makes it a fast, yet extremely easy to use tool for interactively browse, process and search very large amounts of data. One important feature of PyTables is that it optimizes memory and disk resources so that data takes much less space (specially if on-flight compression is used) than other solutions such as relational or object oriented databases.

Furthermore, HDF5 is accessible from other programming languages, most notably Matlab.

Using HDF5 means I have to make a trade-off, either I have 'tables' with all subjects, sessions, measurements, like I did in iApp:

Database structure of iApp

Or I go fully hierarchical, like so:

The first option makes it really fast to calculate things over multiple subjects, because you have all information in one table. Whereas with the hierarchical option, you'd have to traverse the hierarchy to get to such information. On the other hand, the hierarchical approach is much more natural to how you think about the data conceptually, all these measurements belong to a subject and not to everybody.

Because of the choice for PyTables, I'm leaning towards the latter option. One issue though is that I need to pick a identifier for each subject, so you can easily access the data, like so: root.subject_name.session_name.measurement_name However, subject names are hardly unique, especially when using human names, so I'll probably have to use GUIDs or simply use some form of auto-incrementing name "subject" + subject_count and maintain a lookup table to make it easier to find the subject I'm looking for.

lynforster commented 10 years ago

One benefit of the latter is that in addition to walk/run subcategories, you could have day 0, after surgery, after treatment, 6 months check up etc. i.e. track a single subject's data long term. Useful for clinical cases (but yes, less useful for analysis of groups where the table option would be easier...)

ivoflipse commented 10 years ago

So I've seem to have figured out a reasonable structure

Basically every level of nesting has a table that gives an overview of all the groups that are on that level of nesting and contains meta-data which can be used to search for items. The groups get an id, which basically auto-increments, to keep the namespace clean, so in order to find the item you're looking for, you look it up in the table and then get the group you need.

This boils down to a hierarchy like this:

To prevent duplication I've added the requirement to create a subject with a unique combination of first name, last name and birthday. All the others use the name supplied by the user as a unique identifier.

Now I just need to tie it into the models I had, so it no longer stores on disk, but instead creates this table. I'll also need some sort of database widget, because the user needs to be able to supply some subject info.

ivoflipse commented 10 years ago

@lynforster The names were entered manually, so researchers are free to pick whatever they want.

I've also added a table, where you can store the id of a session and some label. That could be used to mark all sessions that are part of a study or label it with different protocol variables.

ivoflipse commented 10 years ago

I've been adding a database tab to make it a bit more structured to create subjects, sessions and add measurements to them:

Its still very much a work in progress, but the process of making a subject and a session seems to work (apart from some rough edges). I've also just added a tree to browse to the folder that contains the measurements for a given session, the next step is to load these measurements and store them in PyTables. And then to extract the contacts and store those, but that's a problem for later.

One problem is that in the current state the rest of the application is broken, because of some changes I've made, so I really need to pull through and get this working.

ivoflipse commented 10 years ago

And yet another step closer:

I've added some widgets which would be needed to store the measurements, such as the brand of the plate/model/frequency etc.

Storing the data in PyTables also seems to work:

While there are still some details to work out, its getting somewhere and shouldn't take too long to re-enable the processing widget (which currently no longer is able to find measurements). Some things I have to work on:

Give feedback on whether a subject and session have been selected. This is necessary, because else I don't have their IDs in memory to create a measurement.
Make the dropdown choices look nicer, use the codified version only in the background.
Add some way for people to add more brands (if they're supported), models and frequencies to their config file
Start searching as soon as you type in any of the fields. This might not work so well yet, because the search functionality I have expects you to look for first/last names, but I'll drop that requirement.
Use color feedback to indicate whether a subject/session etc already contain any data without having the select them and check out what. Perhaps use different colors to indicate whether they have already been processed too.
Autocomplete fields based on previous entries, I'm lazy and I really dislike having to type if I don't have to.

Also I should note that you can just paste in a file path of the folder that contains the measurements for a given session in the field on the right and it will automatically load the files in the folder (if any). Clicking the button next to it brings up a file dialog. The default location is stored in the config file, but I'll probably add a settings tab, so you can easily edit it using a GUI. I'll also figure out how to remember some of the choices made, probably by writing any changes to the config, such as picking a brand, back to the config file.

ivoflipse commented 10 years ago

Getting one step closer:

I've managed to store the contact data and restoring is semi-working again, so making progress :+1:

ivoflipse commented 10 years ago

While I'm still not 100% satisfied with the workflow, at least I can load results from PyTables now:

Well, at least on the processing tab that is. There are still some issues, mostly due to the order in which things are done. I tried decoupling a lot of things, but in the end that just results in a ton of additional bookkeeping, because functions don't know what measurement or session we're in.

Another issue is calculating the average, because this spans multiple measurements and the hierarchy as described above doesn't allow for an intuitive way to iterate over all contacts in a single session. Furthermore, my previous implementation just kept this structure in a dictionary, making it easy to look things up. Now, to extract a simple Contact object, I need to iterate over the rows in the contact table AND over all arrays of that contact, combine the two results for each contact and rebuild a Contact instance.

All doing this without messing without depending on the measurement the GUI thinks we're in, because selecting it through the same route would trigger UI events just to retrieve the data. Worse, to track contacts, I also need the measurement data, so unless I end up mixing different levels of abstraction or split things up, where it becomes hard to control the order in which things happen.

I think I'll refactor the code tomorrow, such that it will load all contacts it can find in the tables for the given session and if the currently selected measurement's contacts aren't present, track those. All in one spot, without breaking things up into 'neat', 'reusable' parts, until I can actually separate the concerns properly.

ivoflipse commented 10 years ago

Given PyTables seems 100% integrated, I'm inclined to close this issue. I still need to get used to updating elements, whenever things change after I've initially stored them, but otherwise it seems to work alright.

ivoflipse / Pawlabeling

Store the data in PyTables #54