nalimilan / FreqTables.jl

Frequency tables in Julia
Other
88 stars 19 forks source link

convenience function to handle simple dataframes of uniform type #12

Closed cjprybol closed 7 years ago

cjprybol commented 7 years ago

Hi Milan,

This pull request would enable FreqTables to work on a DataFrame where multiple columns are of the same type, and the user would like to tabulate the frequency of each unique value among the columns.

Current behavior

               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: http://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.5.1-pre+3 (2016-10-27 05:50 UTC)
 _/ |\__'_|_|_|\__'_|  |  Commit 76c93bf (3 days old release-0.5)
|__/                   |  x86_64-apple-darwin16.1.0

julia> using DataFrames

julia> using FreqTables

julia> using StatsBase

julia> srand(1);

julia> s1 = sample(1:3, 30);

julia> s2 = sample(1:3, 30);

julia> s3 = sample(1:3, WeightVec([.9, .05, .05]), 30);

julia> data = DataFrame(s1 = s1, s2 = s2, s3 = s3)
30×3 DataFrames.DataFrame
│ Row │ s1 │ s2 │ s3 │
├─────┼────┼────┼────┤
│ 1   │ 1  │ 1  │ 2  │
│ 2   │ 2  │ 1  │ 1  │
│ 3   │ 3  │ 1  │ 1  │
│ 4   │ 3  │ 2  │ 1  │
│ 5   │ 3  │ 2  │ 1  │
│ 6   │ 3  │ 2  │ 1  │
│ 7   │ 1  │ 1  │ 2  │
│ 8   │ 3  │ 1  │ 1  │
│ 9   │ 3  │ 1  │ 1  │
│ 10  │ 1  │ 1  │ 1  │
│ 11  │ 1  │ 2  │ 3  │
│ 12  │ 1  │ 1  │ 1  │
│ 13  │ 2  │ 3  │ 1  │
│ 14  │ 1  │ 2  │ 1  │
│ 15  │ 3  │ 1  │ 1  │
│ 16  │ 2  │ 3  │ 1  │
│ 17  │ 3  │ 1  │ 1  │
│ 18  │ 2  │ 1  │ 1  │
│ 19  │ 3  │ 1  │ 1  │
│ 20  │ 3  │ 3  │ 1  │
│ 21  │ 2  │ 3  │ 1  │
│ 22  │ 2  │ 3  │ 1  │
│ 23  │ 2  │ 1  │ 3  │
│ 24  │ 3  │ 2  │ 1  │
│ 25  │ 3  │ 1  │ 1  │
│ 26  │ 2  │ 1  │ 1  │
│ 27  │ 2  │ 1  │ 1  │
│ 28  │ 3  │ 3  │ 2  │
│ 29  │ 1  │ 1  │ 1  │
│ 30  │ 2  │ 2  │ 3  │

julia> freqtable(data)
ERROR: MethodError: no method matching colon(::Int64, ::TypeVar)
Closest candidates are:
colon{T<:Real}(::T<:Real, ::Any, ::T<:Real) at range.jl:114
colon{A<:Real,C<:Real}(::A<:Real, ::Any, ::C<:Real) at range.jl:112
colon{T}(::T, ::Any, ::T) at range.jl:118
...
in _freqtable(::Tuple{}, ::Bool) at /Users/Cameron/.julia/v0.5/FreqTables/src/freqtable.jl:90
in freqtable() at /Users/Cameron/.julia/v0.5/FreqTables/src/freqtable.jl:151
in #freqtable#15(::Array{Any,1}, ::Function, ::DataFrames.DataFrame) at /Users/Cameron/.julia/v0.5/FreqTables/src/freqtable.jl:154
in freqtable(::DataFrames.DataFrame) at /Users/Cameron/.julia/v0.5/FreqTables/src/freqtable.jl:154

Proposed behavior

julia> data = DataFrame(s1 = s1, s2 = s2, s3 = s3)
30×3 DataFrames.DataFrame
│ Row │ s1 │ s2 │ s3 │
├─────┼────┼────┼────┤
│ 1   │ 1  │ 1  │ 2  │
│ 2   │ 2  │ 1  │ 1  │
│ 3   │ 3  │ 1  │ 1  │
│ 4   │ 3  │ 2  │ 1  │
│ 5   │ 3  │ 2  │ 1  │
│ 6   │ 3  │ 2  │ 1  │
│ 7   │ 1  │ 1  │ 2  │
│ 8   │ 3  │ 1  │ 1  │
│ 9   │ 3  │ 1  │ 1  │
│ 10  │ 1  │ 1  │ 1  │
│ 11  │ 1  │ 2  │ 3  │
│ 12  │ 1  │ 1  │ 1  │
│ 13  │ 2  │ 3  │ 1  │
│ 14  │ 1  │ 2  │ 1  │
│ 15  │ 3  │ 1  │ 1  │
│ 16  │ 2  │ 3  │ 1  │
│ 17  │ 3  │ 1  │ 1  │
│ 18  │ 2  │ 1  │ 1  │
│ 19  │ 3  │ 1  │ 1  │
│ 20  │ 3  │ 3  │ 1  │
│ 21  │ 2  │ 3  │ 1  │
│ 22  │ 2  │ 3  │ 1  │
│ 23  │ 2  │ 1  │ 3  │
│ 24  │ 3  │ 2  │ 1  │
│ 25  │ 3  │ 1  │ 1  │
│ 26  │ 2  │ 1  │ 1  │
│ 27  │ 2  │ 1  │ 1  │
│ 28  │ 3  │ 3  │ 2  │
│ 29  │ 1  │ 1  │ 1  │
│ 30  │ 2  │ 2  │ 3  │

julia> freqtable(data)
3×3 Named Array{Int64,2}
value ╲ column │ s1  s2  s3
───────────────┼───────────
1              │  7  17  24
2              │ 10   7   3
3              │ 13   6   3

I'm not sure how common this use case would come up for others, but it came up for me, so I thought I'd see if you were interested in it. I've written a test for this and added an example to the readme.

Worth mentioning, this fails when using a DataFrame of mixed types. In this example, mixed strings and Floats. If you have any suggestions on how to handle this case, I would be interested to hear them!

julia> using RDatasets

julia> iris = dataset("datasets", "iris");

julia> data = iris[[:Species, :PetalWidth]];

julia> head(data)
6×2 DataFrames.DataFrame
│ Row │ Species  │ PetalWidth │
├─────┼──────────┼────────────┤
│ 1   │ "setosa" │ 0.2        │
│ 2   │ "setosa" │ 0.2        │
│ 3   │ "setosa" │ 0.2        │
│ 4   │ "setosa" │ 0.2        │
│ 5   │ "setosa" │ 0.2        │
│ 6   │ "setosa" │ 0.4        │

julia> tail(data)
6×2 DataFrames.DataFrame
│ Row │ Species     │ PetalWidth │
├─────┼─────────────┼────────────┤
│ 1   │ "virginica" │ 2.5        │
│ 2   │ "virginica" │ 2.3        │
│ 3   │ "virginica" │ 1.9        │
│ 4   │ "virginica" │ 2.0        │
│ 5   │ "virginica" │ 2.3        │
│ 6   │ "virginica" │ 1.8        │

julia> freqtable(data)
ERROR: ArgumentError: invalid index: 0.1
 in setindex!(::NamedArrays.NamedArray{Int64,2,Array{Int64,2},Tuple{DataStructures.OrderedDict{Any,Int64},DataStructures.OrderedDict{Symbol,Int64}}}, ::Int64, ::Float64, ::Symbol) at /Users/Cameron/.julia/v0.5/NamedArrays/src/index.jl:196
 in _freqtable(::Tuple{DataArrays.PooledDataArray{Any,UInt8,1},Array{Symbol,1}}, ::FreqTables.UnitWeights, ::Void) at /Users/Cameron/.julia/v0.5/FreqTables/src/freqtable.jl:74
 in #freqtable#5(::FreqTables.UnitWeights, ::Void, ::Function, ::DataArrays.PooledDataArray{Any,UInt8,1}, ::Vararg{AbstractArray{T,1},N}) at /Users/Cameron/.julia/v0.5/FreqTables/src/freqtable.jl:80
 in freqtable(::DataArrays.PooledDataArray{Any,UInt8,1}, ::Array{Symbol,1}, ::Vararg{Array{Symbol,1},N}) at /Users/Cameron/.julia/v0.5/FreqTables/src/freqtable.jl:80
 in #freqtable#15(::Array{Any,1}, ::Function, ::DataFrames.DataFrame, ::Symbol, ::Vararg{Symbol,N}) at /Users/Cameron/.julia/v0.5/FreqTables/src/freqtable.jl:154
 in freqtable(::DataFrames.DataFrame) at /Users/Cameron/.julia/v0.5/FreqTables/src/freqtable.jl:161
coveralls commented 7 years ago

Coverage Status

Coverage decreased (-2.5%) to 96.154% when pulling 1ada4bd8194ab7c6f3fe649210ae794c924565d7 on cjprybol:master into cec61ee2d49e5c7ef558cf8ad055c336d0b29ad0 on nalimilan:master.

nalimilan commented 7 years ago

Thanks for the PR, but I don't think this belongs as a method for freqtable. It's very different from other frequency tables, where variables are crossed. So I'd rather add a separate function. Could you have a look at other statistical packages to see whether they provide such a function and if so how it's called?

Regarding the failure with mixed types, the best behaviour would be to use promote/promote_type to choose the best type for the names. Maybe that should be done automatically by NamedArrays, not sure. In the backtrace you show, names were of type Int, which explains the failure.

cjprybol commented 7 years ago

I couldn't find anything that provided this functionality, although counts and countmap seem to be the closest conceptually. DataFrames has a function colwise, so maybe colwisecounts would be a better name for this? Not a very creative name, but it would be easy enough for users to guess what the function should do.

julia> df
12×3 DataFrames.DataFrame
│ Row │ s1 │ s2 │ s3 │
├─────┼────┼────┼────┤
│ 1   │ 1  │ 1  │ 9  │
│ 2   │ 1  │ 2  │ 9  │
│ 3   │ 1  │ 3  │ 9  │
│ 4   │ 1  │ 1  │ 9  │
│ 5   │ 2  │ 2  │ 9  │
│ 6   │ 2  │ 3  │ 9  │
│ 7   │ 2  │ 1  │ 9  │
│ 8   │ 2  │ 2  │ 9  │
│ 9   │ 3  │ 3  │ 9  │
│ 10  │ 3  │ 1  │ 9  │
│ 11  │ 3  │ 2  │ 9  │
│ 12  │ 3  │ 3  │ 9  │

julia> colwisecounts(df)
4×3 Named Array{Int64,2}
value ╲ column │ s1  s2  s3
───────────────┼───────────
1              │  4   4   0
2              │  4   4   0
3              │  4   4   0
9              │  0   0  12

I couldn't find a way to do this with arrays either, so we could do rowwise and colwise for arrays as well

julia> colwisecounts(a)
4×3 Named Array{Int64,2}
value ╲ column │  1   2   3
───────────────┼───────────
1              │  4   4   0
2              │  4   4   0
3              │  4   4   0
9              │  0   0  12

julia> rowwisecounts(a)
12×4 Named Array{Int64,2}
row ╲ value │ 1  2  3  9
────────────┼───────────
1           │ 2  0  0  1
2           │ 1  1  0  1
3           │ 1  0  1  1
4           │ 2  0  0  1
5           │ 0  2  0  1
6           │ 0  1  1  1
7           │ 1  1  0  1
8           │ 0  2  0  1
9           │ 0  0  2  1
10          │ 1  0  1  1
11          │ 0  1  1  1
12          │ 0  0  2  1
nalimilan commented 7 years ago

Sorry, by "statistical packages" I meant other major languages/environments.

cjprybol commented 7 years ago

I can't find any single functions that do this in R, Python, or SAS. As far as I can tell, R and Python require the data frame to be stacked and then passed to table and pandas.crosstab in the same way that I've done here. In SAS it looks like this can be done by writing a custom call to proc tabulate.

coveralls commented 7 years ago

Coverage Status

Coverage increased (+0.2%) to 98.889% when pulling 43fdaf6e80241b408880317bd03091247fe02990 on cjprybol:master into cec61ee2d49e5c7ef558cf8ad055c336d0b29ad0 on nalimilan:master.

coveralls commented 7 years ago

Coverage Status

Coverage increased (+0.2%) to 98.889% when pulling 43fdaf6e80241b408880317bd03091247fe02990 on cjprybol:master into cec61ee2d49e5c7ef558cf8ad055c336d0b29ad0 on nalimilan:master.

cjprybol commented 7 years ago

arrays

julia> data
12×4 Array{Symbol,2}:
 :a  :a  :a  :d
 :a  :b  :a  :d
 :a  :c  :a  :d
 :a  :a  :a  :d
 :b  :b  :a  :d
 :b  :c  :a  :d
 :b  :a  :a  :d
 :b  :b  :a  :d
 :c  :c  :a  :d
 :c  :a  :a  :d
 :c  :b  :a  :d
 :c  :c  :a  :d

julia> @test colwisecounts(data) == NamedArray(a, (rows, columns), ("value", "column"))
Test Passed
  Expression: colwisecounts(data) == NamedArray(a,(rows,columns),("value","column"))
   Evaluated: 4×4 Named Array{Int64,2}
value ╲ column │  1   2   3   4
───────────────┼───────────────
a              │  4   4  12   0
b              │  4   4   0   0
c              │  4   4   0   0
d              │  0   0   0  12 == 4×4 Named Array{Int64,2}
value ╲ column │  1   2   3   4
───────────────┼───────────────
a              │  4   4  12   0
b              │  4   4   0   0
c              │  4   4   0   0
d              │  0   0   0  12

julia> @test rowwisecounts(data) == NamedArray(a, (rows, columns), ("row", "value"))
Test Passed
  Expression: rowwisecounts(data) == NamedArray(a,(rows,columns),("row","value"))
   Evaluated: 12×4 Named Array{Int64,2}
row ╲ value │ a  b  c  d
────────────┼───────────
1           │ 3  0  0  1
2           │ 2  1  0  1
3           │ 2  0  1  1
4           │ 3  0  0  1
5           │ 1  2  0  1
6           │ 1  1  1  1
7           │ 2  1  0  1
8           │ 1  2  0  1
9           │ 1  0  2  1
10          │ 2  0  1  1
11          │ 1  1  1  1
12          │ 1  0  2  1 == 12×4 Named Array{Int64,2}
row ╲ value │ a  b  c  d
────────────┼───────────
1           │ 3  0  0  1
2           │ 2  1  0  1
3           │ 2  0  1  1
4           │ 3  0  0  1
5           │ 1  2  0  1
6           │ 1  1  1  1
7           │ 2  1  0  1
8           │ 1  2  0  1
9           │ 1  0  2  1
10          │ 2  0  1  1
11          │ 1  1  1  1
12          │ 1  0  2  1

dataframes

julia> data
12×4 DataFrames.DataFrame
│ Row │ sample1 │ sample2 │ sample3 │ sample4 │
├─────┼─────────┼─────────┼─────────┼─────────┤
│ 1   │ a       │ a       │ a       │ d       │
│ 2   │ a       │ b       │ a       │ d       │
│ 3   │ a       │ c       │ a       │ d       │
│ 4   │ a       │ a       │ a       │ d       │
│ 5   │ b       │ b       │ a       │ d       │
│ 6   │ b       │ c       │ a       │ d       │
│ 7   │ b       │ a       │ a       │ d       │
│ 8   │ b       │ b       │ a       │ d       │
│ 9   │ c       │ c       │ a       │ d       │
│ 10  │ c       │ a       │ a       │ d       │
│ 11  │ c       │ b       │ a       │ d       │
│ 12  │ c       │ c       │ a       │ d       │

julia> @test colwisecounts(data) == NamedArray(a, (rows, columns), ("value", "column"))
Test Passed
  Expression: colwisecounts(data) == NamedArray(a,(rows,columns),("value","column"))
   Evaluated: 4×4 Named Array{Int64,2}
value ╲ column │ sample1  sample2  sample3  sample4
───────────────┼───────────────────────────────────
a              │       4        4       12        0
b              │       4        4        0        0
c              │       4        4        0        0
d              │       0        0        0       12 == 4×4 Named Array{Int64,2}
value ╲ column │ sample1  sample2  sample3  sample4
───────────────┼───────────────────────────────────
a              │       4        4       12        0
b              │       4        4        0        0
c              │       4        4        0        0
d              │       0        0        0       12

julia> @test rowwisecounts(data) == NamedArray(a, (rows, columns), ("row", "value"))
Test Passed
  Expression: rowwisecounts(data) == NamedArray(a,(rows,columns),("row","value"))
   Evaluated: 12×4 Named Array{Int64,2}
row ╲ value │ a  b  c  d
────────────┼───────────
1           │ 3  0  0  1
2           │ 2  1  0  1
3           │ 2  0  1  1
4           │ 3  0  0  1
5           │ 1  2  0  1
6           │ 1  1  1  1
7           │ 2  1  0  1
8           │ 1  2  0  1
9           │ 1  0  2  1
10          │ 2  0  1  1
11          │ 1  1  1  1
12          │ 1  0  2  1 == 12×4 Named Array{Int64,2}
row ╲ value │ a  b  c  d
────────────┼───────────
1           │ 3  0  0  1
2           │ 2  1  0  1
3           │ 2  0  1  1
4           │ 3  0  0  1
5           │ 1  2  0  1
6           │ 1  1  1  1
7           │ 2  1  0  1
8           │ 1  2  0  1
9           │ 1  0  2  1
10          │ 2  0  1  1
11          │ 1  1  1  1
12          │ 1  0  2  1

This satisfies the behavior I was looking for. Any thoughts on names?

coveralls commented 7 years ago

Coverage Status

Coverage increased (+0.2%) to 98.889% when pulling 19d93b2eb6ecb7ff7bd3cd363f697051b2e5e4ff on cjprybol:master into cec61ee2d49e5c7ef558cf8ad055c336d0b29ad0 on nalimilan:master.

nalimilan commented 7 years ago

Ah, too bad. Actually, the distinctive feature of this function is not that it's columnwise, since freqtable acts the same: it's that it puts one-way tables side by side instead of computing a cross table. So I think it would make sense to add a keyword argument cross=true to freqtable, which when set to false would give the behavior you want. For consistency with the current freqtable behaviour, it would accept the same inputs: vectors, or a dataframe + column names. When only a data frame is passed, all columns would be used; this behaviour could also be added when cross=true later (with the result of crossing all variables).

Please implement this internally by calling freqtable on each variable to build the one-way tables, and merging them only in the end. That will avoid code duplication. I've put a lot of work to make these generic and efficient, and I wouldn't want to maintain two parallel code bases. For clarity, these can of course be separated into two functions internally.

nalimilan commented 7 years ago

Do you think this isn't needed after all, or is it just that you have other priorities?

cjprybol commented 7 years ago

Both. I don't think it's very important relative to getting everything working with Nullables and the reason I wanted this functionality is no longer relevant (I wanted to translate course material from R to Julia, but I'm not in that class anymore). If I need it again I'll re-open the pull request and finish it off