Creating new components from categorical data

glue-viz / glue

Linked Data Visualizations Across Multiple Files

http://glueviz.org

Other

728 stars 152 forks source link

Creating new components from categorical data #613

Open astrofrog opened 9 years ago

astrofrog commented 9 years ago

I have a use case where I am reading in a table where one of the columns is a date/time string (note that for this specific use cases, things might change after https://github.com/glue-viz/glue/pull/565 but the problem still stands fundamentally).

I wrote a function to convert these to floating-point values:

import numpy as np
from datetime import datetime
from dateutil.parser import parse

def date_to_float_scalar(date):
    d = parse(date)
    dt = (d - datetime(2015, 1, 1, 0, 0, 0))
    return dt.total_seconds()

date_to_float = np.vectorize(date_to_float_scalar)

I then try using this function inside the 'define new component' but the issue is that what gets passed to the function is not the string values, but the underlying floating-point value that glue has assigned to it.

Should we change this to always pass the string value in these cases? The numerical value is internal to glue and outside functions/clients shouldn't really have to know about it?

astrofrog commented 9 years ago

Actually this raises a philosophical question as to whether all string data is categorical. For example in some tables floating point data may be represented as a string. I think what would be nice, and is related to https://github.com/glue-viz/glue/issues/603, is if we were to always preserve the original raw data and then have any interpretation of the data (e.g. categorical, float, int, etc.) be a layer on top of that as a kind of 'view'. So then you could load in string data, and then choose whether to treat it as categorical, float, int, etc. Then when calling custom functions in the 'define new component' one could always pass the raw data so that the user is not surprised. Or we could pass both the raw and current view of the data and the user can choose which to use.

makmanalp commented 9 years ago

"Actually this raises a philosophical question as to whether all string data is categorical"

This is an awesome point that I always struggle to explain people. I think what sucks is when people decide on an arbitrary number like 8 and then decide that string fields with less than 8 unique values are categorical, more than 8 is just strings. This leads to unexpected behavior that makes users angry when they don't know why the same visualization looks different for some datasets.

Similarly with auto-casting types like you have in the other bug: It's great when it works, and immensely frustrating when it doesn't.

astrofrog commented 9 years ago

@makmanalp - thanks for the feedback :) I'll continue some of this discussion in #603