openworm / open-worm-analysis-toolbox

A testing pipeline that allows us to run a behavioural phenotyping of our virtual worm running the same test statistics the Schafer Lab used on their worm data.
Other
48 stars 27 forks source link

Refactor WormFeatures by feature type instead of category #133

Open MichaelCurrie opened 9 years ago

MichaelCurrie commented 9 years ago

Right now our code breaks down features by "category":

But these are really just labels we attach to group the features for reporting purposes. As far as actual calculation, what's more important is the "type":

One can imagine that code common to all event-based features would be best grouped into one class, rather than right now, when coils are in the LocomotionFeature class and omegas, upsilons, and motion_events are in PostureFeature class. Perhaps all four of the event-based features belong in an EventBasedFeature class.

See the individual features and which bucket they fit into in these two schemas in the supplemental documentation.

This is not urgent, or even necessary, but in the future it may be better to group these functionally rather than by category.

JimHokanson commented 9 years ago

Yesterday I was thinking of nixing this issue. Today I think we should make it a higher priority.

I would specify the issue differently.

I'd like to be able to process certain features so that I can test features which are not working see #114 . Currently we can't ignore features. This will also become important with handling skeleton only inputs.

Here's the new layout, I welcome feedback.

Somewhere we have a list of features. This currently exists in feature_metadata. We might create another list and then a unit test that makes sure entries in 1 exist in the other.

In the new list, we have a list of features and their dependencies. My preference would be to have the list order determine computation order rather than trying to determine this in code. The dependencies list would allow us to check that previous entries were enabled and enable them if necessary.

i.e. something like: 1) morphology.length 2) morphology.width 3) morphology.area 4) morphology.area_per_length | morphology.area & morphology.length

We then compute 1, then 2, etc. If we only want 1 & 4 it is relatively trivial to determine that we need 3 as well. This is less difficult that creating the order in code.

WormFeatures get's rewritten as just an enumeration over the list of features. I'm not sure where to store the resulting features. Perhaps WormFeatures gets renamed and then features becomes a dictionary ... (Feedback needed)

Each feature would be rewritten as a class with relevant methods (comparison, saving, loading, explaining/plotting, etc)

In the beginning of each constructor, all relevant dependencies would be extracted from the WormFeatures (or new container) class. This would allow the constructor to grab the relevant data, but would ensure that we don't nest dependencies deep in the calculation code.

I have a class called FeatureProcessingOptions which I would then instrument with methods like ignoreCategory('Morphology') or onlyRun(['morphology.length','morphology.area_per_length']) in addition to other options that currently exist.

I'm thinking of starting a branch for this work. Most of the computation code would go unchanged but how we organize the top level would be changed significantly.

Thoughts?

ver228 commented 9 years ago

Hello,

Talking with Andre about the failed tests, the mismatches in "locomotion.turns.upsilons.start_frames" and in "locomotion.turns", might be because a different version of the code was used in the database creation. If you guys thing this could be the reason, I could take care of it.

About the reorganisation. I like the idea to have the features as a dictionary. This is more similar to the MATLAB structures since they are dynamic data types, and we could even introduce new features without having to change the class definition. Obviously, there is the chance that spurious keys could appear if one is not careful, but I think this is a lesser concern (it could have happen too with the original MATLAB structures). Are we using nested dictionaries, or the keys on the dictionary would be something like 'locomotion.turns.upsilons.start_frames'? I would prefer the later.

I really like the idea of having each feature as a separated class. I think it makes more sense. For example, let's say I have a tracker that only gives me the worm positions. It would be nice to call a single class/function to extract the path features from the x-y coordinates.

JimHokanson commented 9 years ago

@ver228

The values were correct at some point. Either a bug was introduced at some point or the translation from my Matlab code was wrong and code was introduced in the Python version that made the error that was always there known. I'll hopefully have it fixed soon.

The dictionary would go directly to the feature, although it wouldn't go as deep as you are suggesting. start_frames is an attribute of the upsilons feature, not a feature in itself. Thus the dictionary entry would be: 'locomotion.turns.upsilons' or maybe even changed to 'locomotion.upsilons'

I had originally discouraged this refactoring during our code translation from Matlab to Python. Now that the feature code is relatively well established I think it is time to make it a bit more user friendly.

The goal will be to make what you are suggesting regarding path features relatively easy.

MichaelCurrie commented 9 years ago

@JimHokanson p.s. I think you know this but I found an example of getting a class instance to behave like a dict: the Pandas dataframe. The "key" (no pun intended) is to define a __getitem__ method.

You referenced this StackOverflow question to me in our recent conversation so I'll just throw it in here for posterity.

MichaelCurrie commented 9 years ago

I just realized that another benefit of having each of the features "know" its type is that it simplifies the HistogramManager class. Right now, say we have a list of WormFeatures instances wf1, wf2, wf3. We would initialize HistogramManager like this:

histograms = HistogramManager([wf1, wf2, wf3])

Then HistogramManager will create the simple histograms, the movement histograms, and the event histograms, for each of wf1, wf2, wf3, and then merge them for each feature.

Instead if wf1 is a list of features, each of which knows how to create its own histograms (via a create_histograms() method, say), it would simplify HistogramManager and keep it from having to know anything about the histograms at all except for the need to merge them together.

JimHokanson commented 9 years ago

I worked on this a bit tonight. I'm adding the code directly to the master branch so that things stay up to date. There is a new test file in the examples that creates the morphology features via iteration. It still needs a lot of work and will probably need some refactoring once completed but this should make a lot of things cleaner.

JimHokanson commented 9 years ago

An update:

The new code can be run at: test_new_feature_organization in the examples folder. It will currently stop in the debugger after all of the features have been computed.

The code is incomplete but it should give some idea of where things are headed.

Outline: WormFeaturesDos (will soon be new WormFeatures) loads specs. Each spec specifies a feature name and how to instantiate it (module and class name). Each spec is iterated over and created and held in a dictionary and list (the list is probably temporary, I just wanted something ordered for testing). A feature itself should accept as its input (currently to the constructor) the WormFeatures instance as well as potentially an optional flag that can be used however the function sees fit (primarily for calling the same code for different worm sections). All "real" features should have an attribute "value" which is its final value, typically a numpy array. Intermediate features that are only used for calculating later features can hold attributes that are used for later features and don't need a "value" attribute. Some intermediate features are sort of throw away information (e.g. orientation), and others are main information where the final "real" feature only grabs one of its attributes. This is true of the event features in which the event class is first calculated and then the real features just grab individual attributes of the event instance.

Benefits: 1) The resulting features can be iterated over very easily 2) We will soon have the ability to compute only the desired features with lazy loading of dependencies 3) This should make our histogram code cleaner. It may also help with documentation of features. 4) We could move options to the specification, which would make modification cleaner 5) Code for all features could be incorporated into the caller (such as timing), cleaning up the code a bit. 6) Comparisons should become simpler due to inheritance

Big issues left to handle: 1) Build in support for recreating the features from file 2) finish _get_feature_dependency function. We may want to add normalized worms attributes as features 3) Split the final features from intermediate calculations 4) Determine how we want to handle histogram support @MichaelCurrie This should be pretty easy to do. I'd like to build this in ASAP so that we can ditch the old code. I'm not %100 sure we want to change any inheritance for this as much as we might want to introduce an attribute. 5) get feature name from spec, not from hardcoding it in constructors

@clinzy This new code will hopefully be more amenable to documentation ... I've reused classes where possible but some of the old classes are going to be thrown away. I can help with moving code if necessary to bring things into the new approach.

MichaelCurrie commented 7 years ago

@JimHokanson is it safe to close this issue or are the "Big Issues left to handle" still holding this up? I think the fact that the features are calculated in this new way should not change downstream processing much, am I right?

Thanks