queryverse / Query.jl

Query almost anything in julia
Other
394 stars 50 forks source link

Conversion to and from tensors #138

Open davidanthoff opened 7 years ago

davidanthoff commented 7 years ago

@expandingman had this example on the forum:

Suppose I have a table that has a DateTime column, a String column and two Float64 columns. I may, for example, need to get this into the form of an Matrix{Float32}. The String may represent categorical data, so it may have to be mapped to integer designations, which may then be converted to floats (in some cases this will be further transformed into a "one hot" representation, but that can usually be achieved fairly easily within the machine learning framework itself). The DateTime might have to be converted to Float32's representing, for instance, the number of seconds past a reference time. After feeding this Matrix into some machine learning, I'll get back a Matrix that I'll need to append to the original dataset in some way. This gives a rough idea of the most basic problem. Things get way more complicated when you start doing stuff with time series and require rank-3 tensors, but even this most basic case often requires a surprising amount of manual work.

Lets discuss potential solutions for this problem in this issue here.

davidanthoff commented 7 years ago

Thinking about the question of how to construct a matrix first. Here is one approach that could work for matrices:

@from i in source begin
    @select (convert_to_float32(i.string), convert_to_secs_since_ref(i.datetime), i.data1, i.data2)
    @collect Matrix
end

Essentially the idea would be that one selects a Tuple in the @select statement where all tuple elements have the same type. If one collects this into a Matrix, it would created a Matrix where each of the tuples created in the @select statement would be one row of the matrix.

The other direction, querying a matrix, is a little less clear. One approach might be to have a function rows that takes a matrix and then iterates tuples where each tuple is one row of the matrix...

I haven't thought how any of this would generalize to higher order arrays...

ExpandingMan commented 7 years ago

That seems reasonable.

As for the other direction, it is usually easier, but sometimes there is some difficulty in lining up rows. Occasionally one needs to do some sort of inverse transformation of categorical values, but it's certainly hard to solve this in the general case.

I think focusing on matrices for now seems perfectly reasonable. Indeed, in cases where one needs a higher rank tensor, one must also make many other decisions about it's layout. You can see an older example of my efforts to solve this problem here. That worked ok for this particular case for smaller datasets. The matrix case is also very common: pretty much everyone who is doing machine learning will need this. How often this will be the case for higher rank tensors is less clear.

Thanks for the effort!