posit-dev / positron

Positron, a next-generation data science IDE
Other
2.32k stars 68 forks source link

Epic: Data Explorer Summary Panel statistics #2161

Open jthomasmock opened 7 months ago

jthomasmock commented 7 months ago

When the Summary Panel is expanded, it will dynamically calculate and then reveal additional summary statistics for that specific column. This is a lazy operation in the backend as it would otherwise be costly for long/wide datasets.

Summary stats will be right aligned at the decimal place:

NA:           15
Median:       14
Mean:         15.7
SD:            2.1
Min:           1.2
Max:          20.3

Completed:

Parent Categorical: https://github.com/posit-dev/positron/issues/3417

Number

Boolean

String

String sub-category: Categorical/Factor

Date or Datetime or time

Array -- holding off for now

Struct -- holding off for now

Unknown -- holding off for now

softwarenerd commented 6 months ago
/**
 * Possible values for TypeDisplay in ColumnSchema
 */
export enum ColumnSchemaTypeDisplay {
    Number = 'number',
    Boolean = 'boolean',
    String = 'string',
    Date = 'date',
    Datetime = 'datetime',
    Time = 'time',
    Array = 'array',
    Struct = 'struct',
    Unknown = 'unknown'
}
jthomasmock commented 6 months ago

https://github.com/posit-dev/positron/blob/5143bd25007edccad12c8db7c69745b43593b38b/positron/comms/data_explorer-backend-openrpc.json#L333C7-L347C8

jthomasmock commented 6 months ago

@softwarenerd -- I've converted the headers above to type_display enum.

wesm commented 5 months ago

I'm working on improvements in the backend protocol to better support these statistics right now.

I'm not sure it makes sense to compute number of unique values for arrays and structs for now -- there are varying degrees of ease of computing this in different backends, so I'll punt on that for now and we can address it later once we can investigate how to compute that consistently.

jthomasmock commented 5 months ago

Sounds good! I also think it'd be interesting to hear from users on what types of metrics they'd like. I've indicated that we're holding off on the array/structs/unknowns for now

jthomasmock commented 4 months ago

We can close this once #3021 is merged and validated.

petetronic commented 3 months ago

@jthomasmock do we have a good test data that exercises all of the types and thus the column summary statistics? (including precision, null, empty, various types, etc)

we'd want QA to exhaustively cover these statistics to check their validity for the data set.

jthomasmock commented 3 months ago

@jthomasmock do we have a good test data that exercises all of the types and thus the column summary statistics? (including precision, null, empty, various types, etc)

we'd want QA to exhaustively cover these statistics to check their validity for the data set.

I can work on this.

There are some example tests at: https://github.com/r-lib/pillar/blob/main/tests/testthat/test-format_decimal.R

jmcphers commented 3 months ago

@jthomasmock is there still work to do for Beta on this now that #3021 is merged and validated? (we do need tests but we can close this without them)

jthomasmock commented 3 months ago

@jmcphers I think we are still missing date/datetime stats in: Positron Version: 2024.05.0 (Universal) build 1307

image

dfalbel commented 3 months ago

I could pick the backend side for those, I'm assuming @wesm is not working on it yet, right?

wesm commented 3 months ago

I'm working on the float formatting as we speak, so feel free to pick this up

jthomasmock commented 3 months ago

The checkboxes above are the missing stats as of 2024-05-29. Boolean, date, datetime, factor/categorical, and unknown