vega / vega-lite

A concise grammar of interactive graphics, built on Vega.
https://vega.github.io/vega-lite/
BSD 3-Clause "New" or "Revised" License
4.7k stars 618 forks source link

Calculated Field / Formula Transform #451

Closed domoritz closed 9 years ago

domoritz commented 9 years ago

The syntax for expressions is odd at the moment. There are multiple options to refactor

currently:

{
name: ‘Release_Date’
type: ’T’,
fn: ‘year’
}

rename fn to function or as

{
name: ‘Release_Date’
type: ’T’,
as: ‘year’
}

rename to map, accessor and allow more functions.

{
name: ‘Release_Date’
type: ’T’,
map: ‘year’
}

or change how fields are mapped to encodings entirely. This also allows other expressions such as rations and such.

{
value: ‘year(Release_Date)’
type: ’T’
}

See discussion in #447

kanitw commented 9 years ago

Oh now I get what you mean. So when we use value:, we eliminate name: for that field, right?
If yes, I think that's an interesting idea. A couple of interesting decision to make.

domoritz commented 9 years ago

Yes, we would get rid of name. The issue with d. would not be a problem in vega 2 but not vega 1, which is a bit annoying. I was thinking only about the ideal API and not so much the implementation at this point.

As an intermediate step, we could say that we only support simple function calls and maybe ratios and write a simple regex parser for it. However, I'm not really sure about all the implications of my suggestion yet.

kanitw commented 9 years ago

It's worth noting that T (temporal/Date) and Q (quant/number) should support different set of operations.

For T, most of the time, you don't want to support complex calculation. (even + doesn't make much sense) You just want to abstract the time value. And we need to know function name to predict cardinality, etc. (This might change a lot with vega2)

For Q, it's more free form and most of the time, we don't care much about cardinality (unless you cast it to be O).

But yes, from the user perspective, the distinction might make things more complicated.

kanitw commented 9 years ago

Another thought.

Is it kinda weird that aggregation are not included in the expression expr?

For example,

{'aggregation':'min', value:'abs(foo)'}
domoritz commented 9 years ago

I also see how this would make it harder for polestar because you cannot have a dropdown for all available functions.

domoritz commented 9 years ago

Is it kinda weird that aggregation are not included in the expression expr?

I thought about this but tbh, what sql does is super confusing. I think separating aggregations and scalar functions makes sense. Also, if you want to reason about behavior, it's good to have them separate.

kanitw commented 9 years ago

I also see how this would make it harder for polestar because you cannot have a dropdown for all available functions.

Starting from UI, I was thinking about augmenting derive to .data, which is more like Tableau's model where you can add additional custom field to the data and name them, which makes the encoding clean as the encoding only refers to name. (And maybe common derive function such as time abstraction.)

By doing this, we somewhat decouple data manipulation from encoding (except the final aggregation/group-by).

domoritz commented 9 years ago

I think of scalar functions as simple mappings that don't have huge effects. I see how that is not necessarily true because of cardinality estimation but I still don't think this justifies the model of derived fields.

kanitw commented 9 years ago

The model of derived field is not justified by cardinality. It's more for supporting UI and its common data manager, but maybe inferior for developers.

That said, if we choose to use value, we just to do the same thing in vlui to manage the naming but makes writing vega-life by hand easier.

domoritz commented 9 years ago

That said, if we choose to use value, we just to do the same thing in vlui to manage the naming but makes writing vega-life by hand easier.

What do you mean by "manage the naming"?

kanitw commented 9 years ago

I mean for applications like Polestar and Voyager. When user create a new derived variables, you need to name them to refer to them.

However, we implement value model in Vega-lite, this step is skipped as we don't need names in the spec, we just created those derived names on the fly and use anonymous names.

kanitw commented 9 years ago

@domoritz I discuss this with @jheer and we plan to do the derive model as vega-lite’s goal is more for supporting programmatic generation. That said, the derive model isn’t particularly painful for handwritten code anyway.

For fn, we will name it as timeUnit to be consistent with datalib.

kanitw commented 9 years ago

Let’s add a property to data

formula: [
   {field: <field_name>, expr: <Vega expression>}
]
domoritz commented 9 years ago

I don't know why you prefer the derive model. Can you elaborate?

kanitw commented 9 years ago

@domoritz For the record, in #631 you state that

One of the goals of vl is that it can be generated automatically and that it can be supported by other languages (e.g. python). If we have expressions in fields in the encDef, this would be much harder. If we keep it in the data, we can say that we only support formula transforms in js. In other languages, you would do the transformations beforehand.

So I guess you now agree with the decision to go with "derive" model first and we can consider if we want to add the expression model later.

kanitw commented 9 years ago

The “derive” model implemented as data.formula in #631 — the only issue is that stats needs to be provided — but we plan to eliminate that in #648. Therefore this issue can be closed.