ronisbr / PrettyTables.jl

Print data in formatted tables.
MIT License
391 stars 38 forks source link

matrix with named tuples is not recognized #220

Closed bagohart closed 8 months ago

bagohart commented 10 months ago

Hey, cool project! I mostly got the table working that I need, but since I found this bug, I thought I could as well report it.

This works:

pretty_table([(1,2) (3,4); (5,6) (7,8)])

This does not:

pretty_table([(a=1,b=2) (a=3,b=4); (a=5,b=6) (a=7,b=8)])

Instead, I get:

ERROR: The object does not have a valid Tables.jl implementation.

Expected behaviour: the argument to pretty_table should be recognized as a Matrix and printed.

ronisbr commented 10 months ago

Hi @bagohart !

Thanks! It should be an issue in Tables.jl. Tables.jl says this type complies with the interface, but it does not implement the required methods:

julia> using Tables

julia> M = [(a=1, b=2) (a=3, b=4); (a=5, b=6) (a=7, b=8)]
2×2 Matrix{NamedTuple{(:a, :b), Tuple{Int64, Int64}}}:
 (a = 1, b = 2)  (a = 3, b = 4)
 (a = 5, b = 6)  (a = 7, b = 8)

julia> Tables.istable(M)
true

julia> Tables.rowaccess(M)
false

julia> Tables.columnaccess(M)
false

We can fix it temporarily by doing:

julia> import Tables

julia> Tables.columnaccess(x::Matrix{T}) where T<:NamedTuple = true

julia> pretty_table([(a=1,b=2) (a=3,b=4); (a=5,b=6) (a=7,b=8)])
┌───┬───┐
│ a │ b │
├───┼───┤
│ 1 │ 2 │
│ 5 │ 6 │
│ 3 │ 4 │
│ 7 │ 8 │
└───┴───┘

@bkamins, @quinnj can you help here?

bkamins commented 10 months ago

Two things I see:

  1. Tables.rowaccess(M) should return true (this can be a PR to Tables.jl)
  2. At the same time Tables.jl does not guarantee that either Tables.rowaccess or Tables.columnaccess return true. They can both return false for a valid table. (in which case I think internally in PrettyTables.jl you just need to pick if you go for column or row access.) Something like
    if Tables.isTable(M)
    if Tables.rowaccess(M)
      # do something when you have fast row access
    else
      # otherwise just assume column access
    end
    end
ronisbr commented 10 months ago

Hi @bkamins !

Maybe there is just a misunderstanding about Tables.jl from my part. Reading here https://tables.juliadata.org/stable/#Tables.jl-Documentation I thought that something compliant with Tables.jl API must implement:

  1. Tables.istable
  2. (Tables.rowacces and Tables.rows) or (Tables.columnaccess and Tables.column)

Am I wrong?

bkamins commented 10 months ago

You are wrong. Tables.rowacces guarantees fast row access. Tables.columnacces guarantees fast column access.

There can be tables that neither have fast row nor fast column assess (e.g. they have a completely different memory layout and both row and column access requires copying of the data).

At the same there are tables that return true in both cases (because they have layout that allows both approaches without copying):

julia> m = Tables.table([1 2; 3 4])
Tables.MatrixTable{Matrix{Int64}} with 2 rows, 2 columns, and schema:
 :Column1  Int64
 :Column2  Int64

julia> Tables.rowaccess(m)
true

julia> Tables.columnaccess(m)
true
bkamins commented 10 months ago

Also Tables.istable is only a weak contract:

Here is an example of a table that returns false for Tables.istable (but correctly gets materialized as a DataFrame):

julia> struct T
       a::Int
       b::Int
       end

julia> x = [T(1,2), T(3,4), T(5,6)]
3-element Vector{T}:
 T(1, 2)
 T(3, 4)
 T(5, 6)

julia> DataFrame(x)
3×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      2
   2 │     3      4
   3 │     5      6

julia> Tables.istable(x)
false
ronisbr commented 10 months ago

Hi @bkamins,

I confess I am really confused right now. Reading here:

Captura de Tela 2023-09-12 às 17 25 10

It is crystal clear to me that if an object m follows Tables.jl API, which is indicated by returning Tables.istable(m) == true, it must implement Tables.columns(m) or Tables.rows(m). In the former, it also must return Tables.rowaccess(m) == true and in the later Tables.columnaccess(m) == true. By reading that documentation section, this assumption seems to be true even if I need to copy data.

If this assumption is not correct, I think Tables.jl documentation must be changed and I need help. For example, if m complies with Tables.jl, indicated by Tables.istable(m) == true, but both Tables.rowaccess and Tables.columnaccess is false, how am I supposed to access the data?

ronisbr commented 10 months ago

Notice that I am not talking about tables in general (as x in your example), but about objects that follows Tables.jl API.

bkamins commented 10 months ago

These are "brief descriptions". And indeed this is confusing, as the docstrings say something else:

help?> Tables.rowaccess
  Tables.rowaccess(x) => Bool

  Check whether an object has specifically defined that it implements the Tables.rows function that does not copy table data. That is to say, Tables.rows(x) must be done with O(1) time and space complexity
  when Tables.rowaccess(x) == true. Note that Tables.rows will work on any object that iterates AbstractRow-compatible objects, even if they don't define rowaccess, e.g. a Generator of NamedTuples. However,
  this generic fallback may copy the data from input table x. Also note that just because an object defines rowaccess doesn't mean a user should call Tables.rows on it; Tables.columns will also work,
  providing a valid AbstractColumns object from the rows. Hence, users should call Tables.rows or Tables.columns depending on what is most natural for them to consume instead of worrying about what and how
  the input is oriented.

  It is recommended that for users implementing MyType, they define only rowaccess(::Type{MyType}). rowaccess(::MyType) will then automatically delegate to this method.

help?> Tables.columnaccess
  Tables.columnaccess(x) => Bool

  Check whether an object has specifically defined that it implements the Tables.columns function that does not copy table data. That is to say, Tables.columns(x) must be done with O(1) time and space
  complexity when Tables.columnaccess(x) == true. Note that Tables.columns has generic fallbacks allowing it to produces AbstractColumns objects, even if the input doesn't define columnaccess. However, this
  generic fallback may copy the data from input table x. Also note that just because an object defines columnaccess doesn't mean a user should call Tables.columns on it; Tables.rows will also work, providing
  a valid AbstractRow iterator. Hence, users should call Tables.rows or Tables.columns depending on what is most natural for them to consume instead of worrying about what and how the input is oriented.

  It is recommended that for users implementing MyType, they define only columnaccess(::Type{MyType}). columnaccess(::MyType) will then automatically delegate to this method.

Which for me clearly state that the condition is "not copying data" in both cases (so if data would be copied false should be returned).

Let us wait for what @quinnj says.

ronisbr commented 10 months ago

Oh! I haven't seen those docstrings. From this point of view, you are right. So, I will wait @quinnj to verify how I should modify the current PrettyTables.jl interface with Tables.jl.

quinnj commented 10 months ago

@ronisbr, I agree that anyone implementing the interface should define rowaccess or columnaccess along with their definition of rows/columns. Note that when using the interface, rowaccess is not recommended to be checked/used externally, but is only used internally by other tables interface functions. The only thing users should be using from the interface is Tables.rows or Tables.columns.

There are potential advanced use-cases where using rowaccess/columnaccess externally may make sense (for example, the TableOperations.jl package does some tricks around using this, but mainly for passing it through to downstream functions: https://github.com/JuliaData/TableOperations.jl/blob/fab3e72e53d1faf976120de360a5e492fe1e746b/src/TableOperations.jl#L92). But I would say those use-cases should be rare and carefully constructed.

Hopefully that helps clarify?

ronisbr commented 10 months ago

Hi @quinnj !

Thank you very much for the detailed information. I think that is precisely the case with PrettyTables.jl. If we check for row or column access, we can print those tables way faster than just using one type of iterator by default, which can lead to copying a lot of data without necessity.

So, I will use the following algorithm to get the [i, j] element:

1. If we have columns access:
    1. Get the column name related to the j-th column using `Tables.columnnames`.
    2. Get the data using `Tables.getcolumn(table, name)[i]`.
2. If we have row access:
    1. Get the column name related to the j-th column using `Tables.columnnames`.
    2. Try using `subset` to obtain the i-th row. If it fails, fall back to iterating `Tables.rows` for i times.
    3. Get the element using `row[column_name]`.

Does it make sense?

quinnj commented 10 months ago

Yes, that sounds roughly correct. One detail I'd mention is that I think it's fine to check Tables.columnaccess but if false, I wouldn't then check Tables.rowaccess, but just assume it's row-oriented and call Tables.subset, then fallback to Tables.rows if that fails. This would allow, for example, a generator of NamedTuples to still succeed (even though it returns false for Tables.rowaccess).

ronisbr commented 10 months ago

Perfect! Thanks @quinnj !

ronisbr commented 8 months ago

Using the approach proposed by @quinnj , now we can print this type:

julia> pretty_table([(a=1,b=2) (a=3,b=4); (a=5,b=6) (a=7,b=8)])
┌───┬───┐
│ a │ b │
├───┼───┤
│ 1 │ 2 │
│ 5 │ 6 │
│ 3 │ 4 │
│ 7 │ 8 │
└───┴───┘
ronisbr commented 8 months ago

Side note: you are not seeing what you would expect because this type follows Tables.jl API. If you want to see the matrix of named tuples, you can do:

julia> pretty_table(Any[(a=1,b=2) (a=3,b=4); (a=5,b=6) (a=7,b=8)])
┌────────────────┬────────────────┐
│         Col. 1 │         Col. 2 │
├────────────────┼────────────────┤
│ (a = 1, b = 2) │ (a = 3, b = 4) │
│ (a = 5, b = 6) │ (a = 7, b = 8) │
└────────────────┴────────────────┘
bkamins commented 8 months ago

What would a lazy version produce:

(x for x in [(a=1,b=2) (a=3,b=4); (a=5,b=6) (a=7,b=8)])
ronisbr commented 8 months ago

Hi @bkamins !

Sorry for the delay, I missed the notification.

This lazy version will not work since PrettyTables.jl does not know how to handler generators:

julia> a = (x for x in [(a=1,b=2) (a=3,b=4); (a=5,b=6) (a=7,b=8)])
Base.Generator{Matrix{NamedTuple{(:a, :b), Tuple{Int64, Int64}}}, typeof(identity)}(identity, NamedTuple{(:a, :b), Tuple{Int64, Int64}}[(a = 1, b = 2) (a = 3, b = 4); (a = 5, b = 6) (a = 7, b = 8)])

julia> pretty_table(a)
ERROR: The type Base.Generator{Matrix{NamedTuple{(:a, :b), Tuple{Int64, Int64}}}, typeof(identity)} is not supported.