mbauman / Signals.jl

An abandoned work-in-progress for a high level Signal type with a common timebase (in seconds) and groups of channels. Deprecated in favor of AxisArrays.jl.
github.com/JuliaArrays/AxisArrays.jl
Other
6 stars 0 forks source link

Axis arrays and generalization #12

Open mbauman opened 9 years ago

mbauman commented 9 years ago

Right now a Signal labels an array's innermost dimension with an axis. Currently that axis is called time, but it really could be anything. We can label multiple axes, but that requires nested Signals which can be difficult to reason about. Here's a motivating example: let's say you want to event-trigger average the spectral response of 32 EEG channels. With a very-slightly-more-generic Signal type that allows any axis, it'd look something like this:

32-channel Signal
    axis: 400-element Vector[1.2s, 5.3s, 8.0s, ...]
    data:
        400x32 Signal across 400 events and 32 channels
            axis: 21-element range (-1.0s:0.1s:1.0s)
            data:
                21x400x32 Signal across 21 time-bins, 400 events, 32 channels
                    axis: 20-element range (10Hz:10Hz:200Hz)
                    data: 20x21x400x32 Array{Float64, 4}

The "real" data are all the way down in the innermost signal, with each nested layer hiding one dimension. It's a Vector of Vectors of Vectors of Vectors.

This same structure could be described much more richly as just one non-nested 4-D Signal AxisArray object:

 20x21x400x32 AxisArray{Float64, 4, ...} # Behaves as a plain old 4-dimensional array
    axes: (10Hz:10Hz:200Hz, -1.0s:0.1s:1.0s, Vector[1.2s, 5.3s, 8.0s, ...], 1:32)
    axesnames: (:freqbins, :timebins, :events, :channels)
    data: 20x21x400x32 Array{Float64, 4}

This is much easier to reason about, and requires no recursion. Any multi-element indexing will return a sub-AxisArray. What we lose here is the ability to treat inner arrays as single objects. But that might be best expressed through an external interface, e.g., for chan in dimension(A, :channel). Of course, for it to be type-stable, it'd have to be dimension(A, Val{:channel}) and I couldn't rename the dimensions... both of which are a little unweildy.

What this gains us is the ability to have real 1-dimensional vectors of scalars that still know about their timebase or axis. It is generally just the wrapped array, with dimensions. I like the simplicity. It's also way more general -- you can still nest AxisArrays for a Ragged Array or somesuch, but you don't need to if your data are simple and continuous. It's also a lot easier to get to the 'escape hatch' and just get the real data array (A.data vs sig.data.data.data).

It also puts the labeled axes in the position to be directly indexed by an Interval type, so I could get windowed repetitions by doing something like A[:,:,Interval(-1s,1s) .+ event_times] or window(A, Interval(-1s,1s), event_times, :time)

It also makes storage and cross-language compatibility much simpler.


Now the behavior here could potentially be made more complicated to re-enable iteration over inner arrays, too.

32-channel AxisArray{Float64, (2,1), 1, ...} # A 1-dimensional vector of vectors of matrices
    axes: (10Hz:10Hz:200Hz, -1.0s:0.1s:1.0s, ... # exactly the same fields as above

What is interesting here is how it behaves upon indexing. Let's index that 32-channel AxisArray with [:] -- here's what we could get back:

400x32 AxisArray{Float64, (2,), 2} # Behaves as a matrix of matrices, 4 dimensions total

More generally, this could be described as a pivoting operation - pivot(A, inner_dimensions, outer_dimension) would simply rearrange the type parameters so it behaves as you want it. pivot(A, (1,1,1), 1) would return the above AxisArray to a nested group of 4 vectors.

Basically, each indexing operation would peel back the last element of the inner array dimension tuple.


Things to be considered:

Edit: made a clearer distinction between the currently-implemented Signals and the new AxisArray idea.

mbauman commented 9 years ago

@milktrader — are all your columns in TimeSeries always the same datatype? Or do you need heterogeneous aggregates like DataFrames? A TimeSeries could simply be an AxisArray with one ordered dimensional axis and one unordered non-dimensional axis:

AxisArray{Float64, 2, …}
    axes: (Date[…], [:open, :hi, :lo, :close])
    axisnames: (:date, :ohlc)
    data: Nx4 Float64 Matrix …
tshort commented 9 years ago

That feels a bit like pandas' hierarchical indexing:

http://stackoverflow.com/questions/23600582/concatenate-pandas-columns-under-new-multi-index-level

milktrader commented 9 years ago

The type is homogenous within the values element. They are your plain-old Julia array.

It's designed this way for speed.

Sometimes, a natural Int (stock volume, e.g.) is converted to a Float to conform to the array's type. I use a trick to display this as an Int, but underneath it's a Float. R does this too (display floats as ints where it's obvious) and I'm pretty sure this is also done in Pandas

Most of the time, the values of a TimeArray will be a Float64. There are other times though that it might be a Bool, particularly when trade signals are involved. Other times, it might be a String, such as in a blotter application where "long", "short", "flat" have an important meaning.

When heterogeneous types are needed, it's better to go to DataFrames.

Now, about this AxisArray. What is it? Do you have this in the Signals package? I'm intrigued by it.

A quick note about the choice of Symbol vs UTF8String for names. If I recall recently @johnmyleswhite expressed some reservation about DataFrames' choice to go this route. I've avoided it in TimeSeries because I'm not sure what the benefit is. It doesn't allow column names such as "Adjusted Close" because of the space between words.

milktrader commented 9 years ago

I should also note that the TimeArray inner constructor enforces invariants that are important in time series.

mbauman commented 9 years ago

Now, about this AxisArray. What is it? Do you have this in the Signals package? I'm intrigued by it.

Not yet. This is just brainstorming and spit-balling, mostly as notes to myself, so my apologies if anything is confusing. Please ask questions! The README shows actual output from what is currently implemented.

The basic idea in this issue would be to move to something like NamedArrays.jl, except that instead of having an unordered dictionary of row and column (etc) names, there'd be ordered, dimensional axes. Perhaps it could support both unordered (like column names) and ordered axes, which would allow it to be a superset of TimeArray functionality. In which case the Signal name may not be as appropriate, so I may rename or move to the name AxisArray… but this idea is still evolving so I haven't been consistent in how I name things here.

I would check a similar set of invariants, but a major difference would be that it would simply wrap and be the same shape as the underlying data array. Right now you can index your TimeSeries types just by column or just by date. This would explicitly separate the two, and you'd need to do something like ts[:, "close"]. What this enables, though, is the ability to do both: ts[today(), "open"], or even ts[Interval(Date(2013), Date(2014)), "close"] to grab the close values for all of 2013. It also allows extra dimensions: you could have a third dimension of stock symbols (like the SO question Tom pointed to), or the extra dimension could be repetitions of data (e.g., as a first step to finding average weekly or yearly performance). The goal here would be to use indexing with an Interval type to make this sort of operation as simple and as robust as possible: ts[Interval(Day(1), Day(5)) .+ recur(issunday, Date(2013), Date(2014)), "close"] would return an array of weekly performance data that you could then average or look for outliers. (Intervals aren't implemented yet — this sort of operation is currently called windowing or windowed repetitions).

I could see this kind of nested or multi-dimensional behavior as being very powerful for HFT. Making the comparison to the README where I have high-frequency "snippets" about low frequency events, you could represent intra-day prices as time "snippets" on a Date axis (imagine the sparklines in the README as intra-day prices, with the time labels as dates). The goal, then, would be to allow you to do some sort of event-triggered analysis looking at windowed repetitions of stock behavior with respect to some recurring event (like a press release from a competitor) on the scale of seconds or minutes.

My motivation for this is that while you market analysts have had at least some sort of support for date series types across many languages, I've never found a good toolbox to simplify working with high sampling rate electrophysiology recordings. I've always had to carry around a time vector and do all sorts of translation between time and indices. One key element for me is the ability to use FloatRanges to represent the time axes in seconds - many of my analyses depend upon constant sampling rates to enable a whole range of optimizations, but can also be performed on variable sampling rate (e.g., qinterp1 vs interp1). But there's not really any reason for me to limit this to just label axes with seconds (except to compensate for crummy SIUnits typing, but that's another issue).

That's a lot of text. But I hope it gives you some insight into the rationale and direction. My primary use is data of microvolts over time in seconds, but I'd love for this to be more generally useful.

tshort commented 9 years ago

+1 on the direction you're heading. Although most of my stuff is simple enough to fit into just a table, your approach sounds quite flexible. I can see uses for an extra dimension or two.

mbauman commented 9 years ago

@timholy - I'd be interested to hear your thoughts on this design direction, too. I know you've thought a lot about named dimensions and such for Images.jl.

TL/DR: I'm proposing an AxisArray type that knows about its dimension names as well as its spatial layout. One of your fancy 4-d time images could be represented here like this:

50x50x100x10000 AxisArray{RGB, 4, ...}
    axes:(0.1mm:0.1:5.0mm, 0.1mm:0.1:5.0mm, 0.1mm:0.1:10.0mm, 0.1ms:0.1ms:1000ms)
    dims:(:x, :y, :z, :time)
    data: 50x50x100x10000 Array{RGB, 4}

This would allow you to move more of the Images metadata into the array itself.

timholy commented 9 years ago

I like it. A long-stalled start on my own efforts is in NamedAxesArrays.jl, putting this metadata into the parameters (so you can dispatch on it). But I didn't include dimensional information.

milktrader commented 9 years ago

About the issue of putting metadata in the parameters, I've wondered how to implement this in TimeSeries for some time now

https://github.com/JuliaStats/TimeSeries.jl/issues/126

My only concern, and one I haven't yet tested, is that it would slow things down.

Also, any thoughts on not parameterizing the meta field?

@multidis

mbauman commented 9 years ago

Ah, wonderful. Thanks for the pointer to NamedAxesArrays, Tim. I think I'll take that as a starting point and start feeling out the interface. It's great to be able to start out with some of your excellent metaprogramming.

With regards to a meta field/slot, I don't think I'm going to worry about it (beyond the dimensional information), at least not in the first iteration. That seems to be something that's much more domain-specific — each domain's package would define a type that wraps an AxisArray and adds fields for whatever else they are interested in.

mbauman commented 9 years ago

Alright, I've gotten something started over at https://github.com/mbauman/AxisArrays.jl. Y'all are welcome to play along.

milktrader commented 9 years ago

Thanks @mbauman !

timholy commented 9 years ago

Awesome. I just haven't had time to push this forward, so it will be great to collaborate on this.

tshort commented 9 years ago

Great, Matt! Should we expect anything to work, yet?