openworm / movement_cloud

Movement Analysis on the cloud
http://movement.openworm.org/
Other
5 stars 4 forks source link

Refactor features_means #38

Open MichaelCurrie opened 7 years ago

MichaelCurrie commented 7 years ago
  1. Create a features table containing descriptions of each feature (possibly based on https://github.com/openworm/open-worm-analysis-toolbox/blob/master/open_worm_analysis_toolbox/features/feature_metadata/features_list.csv) (DONE)
  2. Add information about a "top 10" features specifically referencing the "default core features" (DONE)
  3. Replace https://github.com/openworm/movement_cloud/blob/dev/webworm/defaultCoreFeatures.json with a view on the features table.
  4. Have features_means reference this features table as a foreign key
  5. Link features_means to experiments (https://github.com/openworm/movement_cloud/issues/14)
  6. Make a schema change PR! (before I do, changes will be live in the MySQL server without being reflected in the schema tracked by version control!)
MichaelCurrie commented 7 years ago

I've added the following table and populated it with all the feature names:

CREATE TABLE features 
    (
        `id` int(11) NOT NULL AUTO_INCREMENT,
        `name` VARCHAR(50),
        `description` VARCHAR(200),
        `is_core_feature` BIT,
        PRIMARY KEY (`id`),
        UNIQUE KEY `name` (`name`)
    ) DEFAULT CHARSET=utf8;

-- Get column names from features_means
INSERT INTO features (`name`)
    SELECT COLUMN_NAME AS 'name'
    FROM INFORMATION_SCHEMA.COLUMNS
    WHERE TABLE_NAME = 'features_means';

I also inserted a "nosuchname" feature matching @cheelee's work:

INSERT INTO features (`name`)
    SELECT 'nosuchname';

Then I added all of @cheelee's feature descriptions from https://github.com/openworm/movement_cloud/blob/dev/webworm/defaultCoreFeatures.json

via some simple data munging. First I stripped the defaultCoreFeatures.json of the outer wrapping to start at the [ array character. Then I did:

import pandas as pd
df = pd.read_json('defaultCoreFeatures.json.txt')
df.to_csv('results.csv')

Then I used excel formulas to create the following lines of code in bulk:

UPDATE features SET is_core_feature = 1, description = 'the length of the worm''s midline' WHERE `name` = 'length';
UPDATE features SET is_core_feature = 1, description = 'the area of the worm' WHERE `name` = 'area';
UPDATE features SET is_core_feature = 1, description = 'the absolute value of the worm''s midbody bend angle' WHERE `name` = 'midbody_bend_mean_abs';
UPDATE features SET is_core_feature = 1, description = 'the maximum distance between the points on the worm body that are farthest from the line connecting the head and the tail' WHERE `name` = 'max_amplitude';
UPDATE features SET is_core_feature = 1, description = 'the absolute value of the worm''s head speed (will capture some contribution from the high frequency ''foraging'' motion of the worm''s head)' WHERE `name` = 'head_tip_speed_abs';
UPDATE features SET is_core_feature = 1, description = 'the absolute value of the worm''s midbody speed' WHERE `name` = 'midbody_speed_abs';
UPDATE features SET is_core_feature = 1, description = 'the farthest distance recorded between the worm''s staring point and any point on its trajectory during the recording' WHERE `name` = 'path_range';
UPDATE features SET is_core_feature = 1, description = 'the frequency of forward motion bouts' WHERE `name` = 'forward_motion_frequency';
UPDATE features SET is_core_feature = 1, description = 'the frequency of pauses (where the worm doesn''t move much)' WHERE `name` = 'paused_motion_frequency';
UPDATE features SET is_core_feature = 0, description = 'Non-Existent entry for Testing Purposes.' WHERE `name` = 'nosuchname';
UPDATE features SET is_core_feature = 1, description = 'the frequency of reversals (where the worm moves backward for some time)' WHERE `name` = 'backward_motion_frequency';
MichaelCurrie commented 7 years ago

@cheelee as an illustration, here's Python code to get the core features from the new database table:

import mysql.connector as sql
import pandas as pd

password_str = passwords.passwords_dict['mcurrie_mysql']

db_connection = sql.connect(host='localhost', database='mrc_db4', user='mcurrie_mysql', password=password_str)
db_cursor = db_connection.cursor()
db_cursor.execute("SELECT name FROM features WHERE is_core_feature = 1 OR `name` = 'nosuchname';")
table_rows = db_cursor.fetchall()
df = pd.DataFrame(table_rows)
print(df)

You could use this (or better, a Django query into the database) to replace your coreFeatures = json.loads(open('webworm/defaultCoreFeatures.json').read()); command. Or I will do this, later, when I figure out how to do Django database queries.

cheelee commented 7 years ago

Ok at this stage, this will require the attention of the three of us -

  1. @ver228 - would you please merge @MichaelCurrie 's new "features" table (renamed to "core_features"?) to the cumulative test database we have (merge_worm_db_111017_1622.sql)?
  2. I can independently work on a version of mrc_db4 on a local branch with the necessary Django code to process the core features table.
ver228 commented 7 years ago

The merge_worm_db_111017_1622 is a dump of my local database. We can delete it. What do you think is the best approach? Should we create a development DB and make any update directly to AWS or should I do modifications locally and merge them later? At this point any change should be data updates. Hopefully no more changes to the schema.

cheelee commented 7 years ago

Ok I think I got confused previously. So the current database mrc_db4 is the fully merged database with all the features and data we need?

If that is the case, I will need to redo what I did using a mysqldump of mrc_db4 on my local machine. An .sql file was what I was expecting earlier, instead of the actual database. Sorry about that!

cheelee commented 7 years ago

@ver228 On the issue of schema changes, we may still eventually need to consider changes if we need to deal with performance issues. I'm currently not too concerned about performance just yet.

cheelee commented 7 years ago

Ok I think I'm starting to get a better understanding about all of this now - the "features" table is primarily a way to store information about a feature's description, along with whether or not that feature is a core feature.

@MichaelCurrie Does this mean that the features table must be pre-generated using the fields of the features_means table, and then populated (description + whether or not it is core) using an external data source (e.g. the JSON file?) This also means we do not have to determine if a field is missing from the database, do we? I think this workflow should be fine as long as information in the features table is optional (i.e. will not break anything if missing) - the only thing is that the interface may either have to trust that the information in the database is consistent (e.g. rows in the features table must be found in the fields of the features_means table;) or conduct its own check at the interface code.

cheelee commented 7 years ago

@MichaelCurrie @ver228 I've modified the Django tool to pull core features information directly from the database now. This is now on the live development server.

MichaelCurrie commented 7 years ago

Does this mean that the features table must be pre-generated using the fields of the features_means table, and then populated (description + whether or not it is core) using an external data source (e.g. the JSON file?)

Yes, that's exactly what I did. But we only need to do this once.

This also means we do not have to determine if a field is missing from the database, do we?

Yes to be 100% we should have an audit step that throws an error if someone adds a feature mean that does not have an entry in the features table. But as you say this kind of checking is probably too fancy for us for now.

the only thing is that the interface may either have to trust that the information in the database is consistent (e.g. rows in the features table must be found in the fields of the features_means table;) or conduct its own check at the interface code.

Yep. At some point we'll want to start unit testing the database in various ways. But for now, I think we are okay.

MichaelCurrie commented 7 years ago

We discussed at the meeting just now that refactoring features_means from having 700+ columns to being a key-value pair table will break @ver228 's pandas scripts so I will hold off doing this until next week. @ver228 please let me know when I can proceed, thanks!

cheelee commented 6 years ago

Should we discuss this again at some point?