Closed cjprybol closed 7 years ago
Thanks for the PR, but I don't think this belongs as a method for freqtable
. It's very different from other frequency tables, where variables are crossed. So I'd rather add a separate function. Could you have a look at other statistical packages to see whether they provide such a function and if so how it's called?
Regarding the failure with mixed types, the best behaviour would be to use promote
/promote_type
to choose the best type for the names. Maybe that should be done automatically by NamedArrays, not sure. In the backtrace you show, names were of type Int
, which explains the failure.
I couldn't find anything that provided this functionality, although counts
and countmap
seem to be the closest conceptually. DataFrames has a function colwise
, so maybe colwisecounts
would be a better name for this? Not a very creative name, but it would be easy enough for users to guess what the function should do.
julia> df
12×3 DataFrames.DataFrame
│ Row │ s1 │ s2 │ s3 │
├─────┼────┼────┼────┤
│ 1 │ 1 │ 1 │ 9 │
│ 2 │ 1 │ 2 │ 9 │
│ 3 │ 1 │ 3 │ 9 │
│ 4 │ 1 │ 1 │ 9 │
│ 5 │ 2 │ 2 │ 9 │
│ 6 │ 2 │ 3 │ 9 │
│ 7 │ 2 │ 1 │ 9 │
│ 8 │ 2 │ 2 │ 9 │
│ 9 │ 3 │ 3 │ 9 │
│ 10 │ 3 │ 1 │ 9 │
│ 11 │ 3 │ 2 │ 9 │
│ 12 │ 3 │ 3 │ 9 │
julia> colwisecounts(df)
4×3 Named Array{Int64,2}
value ╲ column │ s1 s2 s3
───────────────┼───────────
1 │ 4 4 0
2 │ 4 4 0
3 │ 4 4 0
9 │ 0 0 12
I couldn't find a way to do this with arrays either, so we could do rowwise and colwise for arrays as well
julia> colwisecounts(a)
4×3 Named Array{Int64,2}
value ╲ column │ 1 2 3
───────────────┼───────────
1 │ 4 4 0
2 │ 4 4 0
3 │ 4 4 0
9 │ 0 0 12
julia> rowwisecounts(a)
12×4 Named Array{Int64,2}
row ╲ value │ 1 2 3 9
────────────┼───────────
1 │ 2 0 0 1
2 │ 1 1 0 1
3 │ 1 0 1 1
4 │ 2 0 0 1
5 │ 0 2 0 1
6 │ 0 1 1 1
7 │ 1 1 0 1
8 │ 0 2 0 1
9 │ 0 0 2 1
10 │ 1 0 1 1
11 │ 0 1 1 1
12 │ 0 0 2 1
Sorry, by "statistical packages" I meant other major languages/environments.
I can't find any single functions that do this in R, Python, or SAS. As far as I can tell, R and Python require the data frame to be stacked and then passed to table
and pandas.crosstab
in the same way that I've done here. In SAS it looks like this can be done by writing a custom call to proc tabulate
.
arrays
julia> data
12×4 Array{Symbol,2}:
:a :a :a :d
:a :b :a :d
:a :c :a :d
:a :a :a :d
:b :b :a :d
:b :c :a :d
:b :a :a :d
:b :b :a :d
:c :c :a :d
:c :a :a :d
:c :b :a :d
:c :c :a :d
julia> @test colwisecounts(data) == NamedArray(a, (rows, columns), ("value", "column"))
Test Passed
Expression: colwisecounts(data) == NamedArray(a,(rows,columns),("value","column"))
Evaluated: 4×4 Named Array{Int64,2}
value ╲ column │ 1 2 3 4
───────────────┼───────────────
a │ 4 4 12 0
b │ 4 4 0 0
c │ 4 4 0 0
d │ 0 0 0 12 == 4×4 Named Array{Int64,2}
value ╲ column │ 1 2 3 4
───────────────┼───────────────
a │ 4 4 12 0
b │ 4 4 0 0
c │ 4 4 0 0
d │ 0 0 0 12
julia> @test rowwisecounts(data) == NamedArray(a, (rows, columns), ("row", "value"))
Test Passed
Expression: rowwisecounts(data) == NamedArray(a,(rows,columns),("row","value"))
Evaluated: 12×4 Named Array{Int64,2}
row ╲ value │ a b c d
────────────┼───────────
1 │ 3 0 0 1
2 │ 2 1 0 1
3 │ 2 0 1 1
4 │ 3 0 0 1
5 │ 1 2 0 1
6 │ 1 1 1 1
7 │ 2 1 0 1
8 │ 1 2 0 1
9 │ 1 0 2 1
10 │ 2 0 1 1
11 │ 1 1 1 1
12 │ 1 0 2 1 == 12×4 Named Array{Int64,2}
row ╲ value │ a b c d
────────────┼───────────
1 │ 3 0 0 1
2 │ 2 1 0 1
3 │ 2 0 1 1
4 │ 3 0 0 1
5 │ 1 2 0 1
6 │ 1 1 1 1
7 │ 2 1 0 1
8 │ 1 2 0 1
9 │ 1 0 2 1
10 │ 2 0 1 1
11 │ 1 1 1 1
12 │ 1 0 2 1
dataframes
julia> data
12×4 DataFrames.DataFrame
│ Row │ sample1 │ sample2 │ sample3 │ sample4 │
├─────┼─────────┼─────────┼─────────┼─────────┤
│ 1 │ a │ a │ a │ d │
│ 2 │ a │ b │ a │ d │
│ 3 │ a │ c │ a │ d │
│ 4 │ a │ a │ a │ d │
│ 5 │ b │ b │ a │ d │
│ 6 │ b │ c │ a │ d │
│ 7 │ b │ a │ a │ d │
│ 8 │ b │ b │ a │ d │
│ 9 │ c │ c │ a │ d │
│ 10 │ c │ a │ a │ d │
│ 11 │ c │ b │ a │ d │
│ 12 │ c │ c │ a │ d │
julia> @test colwisecounts(data) == NamedArray(a, (rows, columns), ("value", "column"))
Test Passed
Expression: colwisecounts(data) == NamedArray(a,(rows,columns),("value","column"))
Evaluated: 4×4 Named Array{Int64,2}
value ╲ column │ sample1 sample2 sample3 sample4
───────────────┼───────────────────────────────────
a │ 4 4 12 0
b │ 4 4 0 0
c │ 4 4 0 0
d │ 0 0 0 12 == 4×4 Named Array{Int64,2}
value ╲ column │ sample1 sample2 sample3 sample4
───────────────┼───────────────────────────────────
a │ 4 4 12 0
b │ 4 4 0 0
c │ 4 4 0 0
d │ 0 0 0 12
julia> @test rowwisecounts(data) == NamedArray(a, (rows, columns), ("row", "value"))
Test Passed
Expression: rowwisecounts(data) == NamedArray(a,(rows,columns),("row","value"))
Evaluated: 12×4 Named Array{Int64,2}
row ╲ value │ a b c d
────────────┼───────────
1 │ 3 0 0 1
2 │ 2 1 0 1
3 │ 2 0 1 1
4 │ 3 0 0 1
5 │ 1 2 0 1
6 │ 1 1 1 1
7 │ 2 1 0 1
8 │ 1 2 0 1
9 │ 1 0 2 1
10 │ 2 0 1 1
11 │ 1 1 1 1
12 │ 1 0 2 1 == 12×4 Named Array{Int64,2}
row ╲ value │ a b c d
────────────┼───────────
1 │ 3 0 0 1
2 │ 2 1 0 1
3 │ 2 0 1 1
4 │ 3 0 0 1
5 │ 1 2 0 1
6 │ 1 1 1 1
7 │ 2 1 0 1
8 │ 1 2 0 1
9 │ 1 0 2 1
10 │ 2 0 1 1
11 │ 1 1 1 1
12 │ 1 0 2 1
This satisfies the behavior I was looking for. Any thoughts on names?
Ah, too bad. Actually, the distinctive feature of this function is not that it's columnwise, since freqtable
acts the same: it's that it puts one-way tables side by side instead of computing a cross table. So I think it would make sense to add a keyword argument cross=true
to freqtable
, which when set to false
would give the behavior you want. For consistency with the current freqtable
behaviour, it would accept the same inputs: vectors, or a dataframe + column names. When only a data frame is passed, all columns would be used; this behaviour could also be added when cross=true
later (with the result of crossing all variables).
Please implement this internally by calling freqtable
on each variable to build the one-way tables, and merging them only in the end. That will avoid code duplication. I've put a lot of work to make these generic and efficient, and I wouldn't want to maintain two parallel code bases. For clarity, these can of course be separated into two functions internally.
Do you think this isn't needed after all, or is it just that you have other priorities?
Both. I don't think it's very important relative to getting everything working with Nullable
s and the reason I wanted this functionality is no longer relevant (I wanted to translate course material from R to Julia, but I'm not in that class anymore). If I need it again I'll re-open the pull request and finish it off
Hi Milan,
This pull request would enable FreqTables to work on a DataFrame where multiple columns are of the same type, and the user would like to tabulate the frequency of each unique value among the columns.
Current behavior
Proposed behavior
I'm not sure how common this use case would come up for others, but it came up for me, so I thought I'd see if you were interested in it. I've written a test for this and added an example to the readme.
Worth mentioning, this fails when using a DataFrame of mixed types. In this example, mixed strings and Floats. If you have any suggestions on how to handle this case, I would be interested to hear them!