nalimilan / FreqTables.jl

Frequency tables in Julia
Other
90 stars 19 forks source link

proptable with margins drops dimension information #64

Closed bkamins closed 2 years ago

bkamins commented 3 years ago
julia> using DataFrames

julia> using FreqTables

julia> df = DataFrame(a=[1,1,1,2,2,2], b=[1,1,1,2,2,2])
6×2 DataFrame
 Row │ a      b     
     │ Int64  Int64 
─────┼──────────────
   1 │     1      1
   2 │     1      1
   3 │     1      1
   4 │     2      2
   5 │     2      2
   6 │     2      2

julia> freqtable(df, :a, :b)
2×2 Named Matrix{Int64}
a ╲ b │ 1  2
──────┼─────
1     │ 3  0
2     │ 0  3

julia> proptable(df, :a, :b)
2×2 Named Matrix{Float64}
a ╲ b │   1    2
──────┼─────────
1     │ 0.5  0.0
2     │ 0.0  0.5

julia> proptable(df, :a, :b, margins=1)
2×2 Matrix{Float64}:
 1.0  0.0
 0.0  1.0

also I have just checked that the example from the docstring is also incorrect:

julia> proptable([1, 2, 2, 2], [1, 1, 1, 2], margins=1)
2×2 Matrix{Float64}:
 1.0       0.0
 0.666667  0.333333
nalimilan commented 3 years ago

Maybe that's a bug in NamedArrays? We just use ./ here: https://github.com/nalimilan/FreqTables.jl/blob/d35f4b77f53d1650047473bd31fe41c0cbe1f7a3/src/freqtable.jl#L276

Cc: @davidavdav

bkamins commented 3 years ago

It is possible indeed that some broadcasting logic has changed in Julia Base (this happens from time to time - I have the same similar problems in DataFrames.jl).

davidavdav commented 3 years ago

If I recall correctly, NamedArrays has some logic to give the reduced dimension a name along the lines of "Sum(colname)".

If the sum(tbl...) gets replaced by sum(tbl.array..., does the problem persist?

—david

bkamins commented 3 years ago

This fixes the issue, but in the past it was working correctly.

@nalimilan the fix is:

return tbl ./ sum(tbl isa NamedArray ? tbl.array : tbl, dims=tuple(setdiff(1:N, margins)...)::NTuple{N-length(margins),Int})
nalimilan commented 3 years ago

Cool. Though that's a bug in NamedArrays, right? Do you think it can be fixed?

bkamins commented 3 years ago

I do not know NamedArrays.jl details enough, but for sure it worked in the past (I caught this issue when I was updating old tutorials), so there is an issue in NamedArrays.jl to be fixed.

davidavdav commented 3 years ago

Hi,

If someone can file an issue against NamedArrays with a minimum failing example (I know there is a nice word for that but I can't think of it right now) I'll have a look at it.

I get the feeling that somehow the dimension along which are to be summed get lost in translation.

—david

On Mon, May 17, 2021 at 10:25 AM Bogumił Kamiński @.***> wrote:

I do not know NamedArrays.jl details enough, but for sure it worked in the past (I caught this issue when I was updating old tutorials), so there is an issue in NamedArrays.jl to be fixed.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nalimilan/FreqTables.jl/issues/64#issuecomment-842127321, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABJ6DV5F62DZNCOZHJSRHT3TODHJFANCNFSM44XBWYOQ .

-- David van Leeuwen (다빛, 大卫, دافد)

bkamins commented 3 years ago

The issue is when dimension names are integers. See https://github.com/davidavdav/NamedArrays.jl/issues/110

davidavdav commented 3 years ago

I think the problem was not so much the integers (although that is asking for trouble, but alas, this case happens often in FreqTables), but rather that the types of left and right operands weren't exactly the same (the right has a string as index).

The fix uses the type of the left operand, I am not sure this is always going to be the correct choice.

bkamins commented 2 years ago

Closing as this is fixed