posit-dev / positron

Positron, a next-generation data science IDE
Other
1.48k stars 43 forks source link

Data Explorer: Summary statistics heuristics for precision #2339

Open jthomasmock opened 4 months ago

jthomasmock commented 4 months ago

Problem Space: How to handle decimal precision across extremely broad ranges of possible data.

Guiding principles:

Tasks

Very large data:

Very small data (<1):

5.  e-2 
5.67e-2 
2.7 e-8

Alternatively, we could go with only necessary scientific notation, but I think that the consistent scientific notation is a bit cleaner.

0.05 
0.0567
2.7 e-8
0.05
0.0567
0.0113

I think it would be useful to coordinate some of the existing logic/heuristics that tibble and pillar use:

We should be able to apply extremely similar numerical handling for sane defaults.


Backend

We may need to handle rounding or even display on the backend.

wesm commented 2 months ago

The summary statistics are pretty ugly now. Part of this will be returning the unformatted numbers from the backend and handling the formatting in the UI

wesm commented 1 month ago

We should have a thin space for each three digit group, ie 1000000. becomes 1 000 000. with thinner spaces. We'd still need to be careful to make sure alignment across rows at the decimal place is valid.

This avoids major locale problems with using , meaning a decimal in Europe.

Since we moved to fixed-space fonts, this thin space solution won't work anymore. My first principles approach would be to add formatting options to the get_data_values request (e.g. pass the thousands separator and decimal point that you want based on the application locale). Thoughts? cc @jmcphers

jthomasmock commented 1 month ago

Yah eventually we may want to make it configurable or approach like pillar with underscores instead of spaces. That is tricky for copy-paste out though

image

jmcphers commented 1 month ago

How about adding thin spaces using spans with padding but no contents? Those will copy out cleanly but will also let us format things nicely.

1<span class="tinyspace"></span>000<span class="tinyspace"></span>000

We'd obviously need to add these on the frontend, probably fine as long as we know that the column type is numeric and it's parseable as such

wesm commented 1 month ago

We would have to put some kind of placeholder unicode character on the backend so that the frontend can reliable replace it with the HTML display formatting that we want

wesm commented 1 month ago

What else do we want to try to do this week from where things are now?

jthomasmock commented 1 month ago

What else do we want to try to do this week from where things are now?

The Python formatting in summary stats looks remarkably good -- thanks for all the PRs!

Only thing I see missing is missing and categorical types, which I think are captured in #2161

jthomasmock commented 1 month ago

@wesm it does stickout to me a bit that we're adding a lot of sig-fig in the decimals. I think it'd be nice if > 1, then avoid printing more than 2 decimal places, ie 1.23 or 1.00 is ok but 1.23456 or 1.00000 is a bit much.

image

wesm commented 1 month ago

I think it'd be nice if > 1, then avoid printing more than 2 decimal places, ie 1.23 or 1.00 is ok but 1.23456 or 1.00000 is a bit much.

Right -- we discussed this a bit in the past. If we want decimal alignment with numbers > 1 and small numbers between 1 and -1, we either:

I will go ahead and do #2 until we are ready to implement #3325.

I am also not sure why the numbers are left-aligned all the sudden, that looks like a bug to me cc @softwarenerd I see this is #3376