queryverse / IterableTables.jl

Implementations of the TableTraits.jl interface for various packages
Other
79 stars 9 forks source link

Add support for NamedArray #65

Open scls19fr opened 6 years ago

scls19fr commented 6 years ago

Hello,

I'm using FreqTables.jl freqtable. This function outputs NamedArray objects. Maybe it could be a good idea to add support for NamedArrays into IterableTables.

Kind regards

davidanthoff commented 6 years ago

Do these have a table structure, though? I'm just not sure...

scls19fr commented 6 years ago

If dimension is no more than 2, I think so. If dimension is greater, than an error should be raised. Your comment is very interesting as it shows that there is a room for a project (comparable to yours) that could deal with converting n dimensional datastructures such as NamedArray or IndexedTables currently

davidanthoff commented 6 years ago

Interesting, interesting... I think the general pattern he might be some way to get rows out of a matrix, both a named and a normal matrix. For Query.jl integration it would be nice if these rows were tuples and named tuples respectively. But I'm not sure that is in general a good idea, I'm not sure whether very large tuples (if there are very many rows) work well...

It would have to be a special function, in any case, something like rows. I'll have to think a bit more about that scenario.

scls19fr commented 6 years ago

See https://github.com/davidavdav/NamedArrays.jl/issues/55 about implementing an Iterator of NamedTuples for NamedArrays

scls19fr commented 6 years ago

Pinging @davidavdav

davidavdav commented 6 years ago

You rang?

I am a bit out of context, I've tried to read up to the references, but it is not clear to me what an iterator of named tuples would do with a NamedArray. Do you want to be able to iterate over a dimension on a NamedArray, thereby getting a tuple of the name and the array slice of one dimension lower, and do you want this to be of type Iterator for NamedTuples?

scls19fr commented 6 years ago

I think we should first tackle NamedArrays with dimension of 2 and see how such a NamedArray can be converted to a DataFrame (or any table-like data structure that IterableTables deals with as a sink) IterableTable can easily use a source which consist of any iterator who produces elements of type NamedTuple.

The problem of dimensions greater than 2 can't be handle in this project (I think... or at least not for now).

nalimilan commented 6 years ago

In R tables of arbitrary dimensionality can be converted to data frames. Each dimension gets transformed into a column, and an additional columns holds the values of the array entries.

davidanthoff commented 6 years ago

@nalimilan so all the columns except the last one would hold the indices of that respective dimension? For example, this matrix:

1  2  3
4  5  6
7  8  9

Would be transformed into this table:

Dim1 Dim2 Value
1 1 1
1 2 2
1 3 3
2 1 4
2 2 5
2 3 6
3 1 7
3 2 8
3 3 9

? I think that might be a really good general solution. It would still be nice if there was some easy way to handle a matrix differently, i.e. keep the table structure of the matrix, but one that could be done by say a row function.

I guess this also somehow interacts with this idea of how associative are handled. This would essentially treat an array as an associative, with the dimension indices as the key. I didn't follow that debate in detail, though...

nalimilan commented 6 years ago

Yes, that's it, though it's even clearer when there are dimension names rather than indices.

davidavdav commented 6 years ago

Ik looks like an R-style NamedArray -> DataFrame export would generally be beneficial.

scls19fr commented 6 years ago

If we can have NamedArray export to IndexedTables (and vice versa) it will be great. Adding support of NamedArray to IterableTables will help to achieve this goal.

davidavdav commented 6 years ago

So you would want to treat 0s in the NamedArray special for IndexedTables, i.e., leave these entries out? That makes sense for FreqTable output, but for a NamedArray 0 is type-specific and otherwise not very special.

nalimilan commented 6 years ago

I agree zeros should not be dropped.

davidavdav commented 6 years ago

How would this do for you?

using IndexedTables

import IndexedTables.IndexedTable

function IndexedTable(n::NamedArray)
    L = length(n) # elements in array
    cols = Dict{Symbol, Array}()
    factor = 1
    for d in 1:ndims(n)
        nlevels = size(n, d)
        nrep = L ÷ (nlevels * factor)
        data = repmat(vcat([fill(x, factor) for x in names(n, d)]...), nrep)
        cols[Symbol(dimnames(n, d))] = data
        factor *= nlevels
    end
    return IndexedTable(Columns(;cols...), array(n)[:])
end
scls19fr commented 6 years ago

the 2 behaviours could be considered when converting NamedArray to IndexedTables. 1) Treat "0" (or any other value) as a special value that don't need to be report to IndexTable (because this kind of datastructure was specially designed to deal with sparse data) 2) treating all values in the same way should also be a possibility (maybe the default one)

davidavdav commented 6 years ago

The simple implementation above would not become very efficient, memory-wise, for very large and very sparse tables, if we filter out the 0s afterwards. Anyway, the repmat(vcat([filll(...)]) is probably not the most efficient.

scls19fr commented 6 years ago

Maybe with the aim of filtering out some values, we should probably accept anonymous function instead of a given value such as "0"

x -> x == 0
davidanthoff commented 6 years ago

There are two issues here, right? How to convert something to a IndexedTable, and how to convert something to just any table. Only the latter interacts with iterable tables at this point.

scls19fr commented 6 years ago

Not sure if there is really two issues here in fact... Your comment https://github.com/davidanthoff/IterableTables.jl/issues/65#issuecomment-345538256 shows that, if we extend your example to a 3 dim named array (or more) you can even transform it to a two dimensional array and so a table (and also to IndexedTables as it's a sink that is currenly supported by IterableTables)

scls19fr commented 6 years ago

On the other side... there is an issue about IndexedTables output with IterableTables not being able to filter out values to keep the sparse feature of IndexedTables

julia> using IterableTables

julia> using IndexedTables

julia> a=[0 0 1 0;2 0 3 0;0 0 5 0;2 0 0 1]
4×4 Array{Int64,2}:
 0  0  1  0
 2  0  3  0
 0  0  5  0
 2  0  0  1

julia> IndexedTable(a)
─────┬──
1  1 │ 0
1  2 │ 0
1  3 │ 1
1  4 │ 0
2  1 │ 2
2  2 │ 0
2  3 │ 3
2  4 │ 0
3  1 │ 0
3  2 │ 0
3  3 │ 5
3  4 │ 0
4  1 │ 2
4  2 │ 0
4  3 │ 0
4  4 │ 1

we could expect an api like

julia> IndexedTable(a, x -> x == 0)
─────┬──
1  3 │ 1
2  1 │ 2
2  3 │ 3
3  3 │ 5
4  1 │ 2
4  4 │ 1

If we want anonymous function to filter out

So in this case... this is clearly an other issue

or

julia> IndexedTable(a, x -> x != 0)

if we want anonymous function to define which values we want to keep

Issue opened at https://github.com/JuliaComputing/IndexedTables.jl/issues/91

scls19fr commented 6 years ago

Thanks to @davidavdav commit https://github.com/davidavdav/NamedArrays.jl/commit/5b8205f35198974c3597d46e69acc3538c549477 a NamedArray of any dimension can now be flattened (returning a flattened NamedArray) as exposed in https://github.com/davidanthoff/IterableTables.jl/issues/65#issuecomment-345538256

julia> using NamedArrays
julia> srand(1234);

julia> n=NamedArray(rand(2,4,3))
2×4×3 Named Array{Float64,3}

[:, :, C=1] =
A ? B │        1         2         3         4
──────┼───────────────────────────────────────
1     │ 0.590845  0.566237  0.794026  0.200586
2     │ 0.766797  0.460085  0.854147  0.298614

[:, :, C=2] =
A ? B │         1          2          3          4
──────┼───────────────────────────────────────────
1     │  0.246837   0.648882   0.066423   0.646691
2     │  0.579672  0.0109059   0.956753   0.112486

[:, :, C=3] =
A ? B │         1          2          3          4
──────┼───────────────────────────────────────────
1     │  0.276021  0.0566425   0.950498   0.945775
2     │  0.651664   0.842714    0.96467   0.789904

julia> n[:]
24-element Named Array{Float64,1}
(:A, :B, :C)    │
────────────────┼──────────
("1", "1", "1") │  0.590845
("2", "1", "1") │  0.766797
("1", "2", "1") │  0.566237
("2", "2", "1") │  0.460085
("1", "3", "1") │  0.794026
("2", "3", "1") │  0.854147
("1", "4", "1") │  0.200586
("2", "4", "1") │  0.298614
("1", "1", "2") │  0.246837
("2", "1", "2") │  0.579672
("1", "2", "2") │  0.648882
("2", "2", "2") │ 0.0109059
("1", "3", "2") │  0.066423
("2", "3", "2") │  0.956753
("1", "4", "2") │  0.646691
("2", "4", "2") │  0.112486
("1", "1", "3") │  0.276021
("2", "1", "3") │  0.651664
("1", "2", "3") │ 0.0566425
("2", "2", "3") │  0.842714
("1", "3", "3") │  0.950498
("2", "3", "3") │   0.96467
("1", "4", "3") │  0.945775
("2", "4", "3") │  0.789904