Inconsistent Round behavior

ek-ex commented 3 months ago

Checks

[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of Polars.

Reproducible example


import polars as pl

df = pl.read_csv('AAPL.csv', has_header=False, try_parse_dates=True, new_columns=['timestamp',"open","high","low","close","volume"], 
                    dtypes={"open": pl.Float32, "high": pl.Float32, "low": pl.Float32, "close": pl.Float32, "volume": pl.Float32})
df

timestamp	open	high	low	close	volume
datetime[μs]	f32	f32	f32	f32	f32
2005-01-03 08:00:00	0.9979	0.9984	0.9979	0.9984	45594.0
2005-01-03 08:02:00	0.9903	0.9903	0.9903	0.9903	354001.0
2005-01-03 08:03:00	0.9995	0.9996	0.9995	0.9996	19540.0
2005-01-03 08:04:00	1.0003	1.0026	1.0003	1.0026	187845.0
2005-01-03 08:07:00	1.0012	1.0012	1.001	1.001	58620.0
…	…	…	…	…	…
2024-04-19 19:40:00	164.399994	164.399994	164.399994	164.399994	100.0
2024-04-19 19:43:00	164.430099	164.430099	164.430099	164.430099	600.0
2024-04-19 19:44:00	164.429993	164.440002	164.429993	164.440002	383.0
2024-04-19 19:47:00	164.479996	164.479996	164.479996	164.479996	445.0
2024-04-19 19:48:00	164.479996	164.479996	164.429993	164.449997	600.0

df.with_columns(
    pl.col('open').round(3)
)

timestamp	open	high	low	close	volume
datetime[μs]	f32	f32	f32	f32	f32
2005-01-03 08:00:00	0.998	0.9984	0.9979	0.9984	45594.0
2005-01-03 08:02:00	0.99	0.9903	0.9903	0.9903	354001.0
2005-01-03 08:03:00	0.999	0.9996	0.9995	0.9996	19540.0
2005-01-03 08:04:00	1.0	1.0026	1.0003	1.0026	187845.0
2005-01-03 08:07:00	1.001	1.0012	1.001	1.001	58620.0
…	…	…	…	…	…
2024-04-19 19:40:00	164.399994	164.399994	164.399994	164.399994	100.0
2024-04-19 19:43:00	164.429993	164.430099	164.430099	164.430099	600.0
2024-04-19 19:44:00	164.429993	164.440002	164.429993	164.440002	383.0
2024-04-19 19:47:00	164.479996	164.479996	164.479996	164.479996	445.0
2024-04-19 19:48:00	164.479996	164.479996	164.429993	164.449997	600.0

df.with_columns(
    pl.col('open').round(1)
)

timestamp	open	high	low	close	volume
datetime[μs]	f32	f32	f32	f32	f32
2005-01-03 08:00:00	1.0	0.9984	0.9979	0.9984	45594.0
2005-01-03 08:02:00	1.0	0.9903	0.9903	0.9903	354001.0
2005-01-03 08:03:00	1.0	0.9996	0.9995	0.9996	19540.0
2005-01-03 08:04:00	1.0	1.0026	1.0003	1.0026	187845.0
2005-01-03 08:07:00	1.0	1.0012	1.001	1.001	58620.0
…	…	…	…	…	…
2024-04-19 19:40:00	164.399994	164.399994	164.399994	164.399994	100.0
2024-04-19 19:43:00	164.399994	164.430099	164.430099	164.430099	600.0
2024-04-19 19:44:00	164.399994	164.440002	164.429993	164.440002	383.0
2024-04-19 19:47:00	164.5	164.479996	164.479996	164.479996	445.0
2024-04-19 19:48:00	164.5	164.479996	164.429993	164.449997	600.0


df.with_columns(
    pl.col('open').round(2)
)

timestamp	open	high	low	close	volume
datetime[μs]	f32	f32	f32	f32	f32
2005-01-03 08:00:00	1.0	0.9984	0.9979	0.9984	45594.0
2005-01-03 08:02:00	0.99	0.9903	0.9903	0.9903	354001.0
2005-01-03 08:03:00	1.0	0.9996	0.9995	0.9996	19540.0
2005-01-03 08:04:00	1.0	1.0026	1.0003	1.0026	187845.0
2005-01-03 08:07:00	1.0	1.0012	1.001	1.001	58620.0
…	…	…	…	…	…
2024-04-19 19:40:00	164.399994	164.399994	164.399994	164.399994	100.0
2024-04-19 19:43:00	164.429993	164.430099	164.430099	164.430099	600.0
2024-04-19 19:44:00	164.429993	164.440002	164.429993	164.440002	383.0
2024-04-19 19:47:00	164.479996	164.479996	164.479996	164.479996	445.0
2024-04-19 19:48:00	164.479996	164.479996	164.429993	164.449997	600.0

Log output

No response

Issue description

I'm using the round function in a column of floats with many decimals. However, sometimes the round is working as expected, sometimes it isnt.

Expected behavior

Round function consistently applied to all rows.

Installed versions

``` --------Version info--------- Polars: 0.20.18 Index type: UInt32 Platform: Linux-6.5.0-28-generic-x86_64-with-glibc2.35 Python: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] ----Optional dependencies---- adbc_driver_manager: cloudpickle: connectorx: deltalake: fastexcel: fsspec: 2024.3.1 gevent: hvplot: matplotlib: 3.8.4 nest_asyncio: 1.6.0 numpy: 1.26.4 openpyxl: pandas: pyarrow: pydantic: pyiceberg: pyxlsb: sqlalchemy: xlsx2csv: xlsxwriter: ```

Julian-J-S commented 3 months ago

hi @ek-ex

I can reproduce this.

the problem though is not the round function but rather the display/formatting of f32 data.

pl.Config.set_fmt_str_lengths(100)

DATA = [1.0, 1.2, 1.3, 1.4, 1.5, 100.1, 100.2, 100.3, 100.4, 100.5]

pl.DataFrame(
    {"f32": DATA, "f64": DATA},
    schema={"f32": pl.Float32, "f64": pl.Float64},
).with_columns(
    f32_decimals=pl.col("f32").map_elements(lambda x: f"{x:.20f}"),
    f64_decimals=pl.col("f64").map_elements(lambda x: f"{x:.20f}"),
)

# shape: (10, 4)
# ┌────────────┬───────┬──────────────────────────┬──────────────────────────┐
# │ f32        ┆ f64   ┆ f32_decimals             ┆ f64_decimals             │
# │ ---        ┆ ---   ┆ ---                      ┆ ---                      │
# │ f32        ┆ f64   ┆ str                      ┆ str                      │
# ╞════════════╪═══════╪══════════════════════════╪══════════════════════════╡
# │ 1.0        ┆ 1.0   ┆ 1.00000000000000000000   ┆ 1.00000000000000000000   │
# │ 1.2        ┆ 1.2   ┆ 1.20000004768371582031   ┆ 1.19999999999999995559   │
# │ 1.3        ┆ 1.3   ┆ 1.29999995231628417969   ┆ 1.30000000000000004441   │
# │ 1.4        ┆ 1.4   ┆ 1.39999997615814208984   ┆ 1.39999999999999991118   │
# │ 1.5        ┆ 1.5   ┆ 1.50000000000000000000   ┆ 1.50000000000000000000   │
# │ 100.099998 ┆ 100.1 ┆ 100.09999847412109375000 ┆ 100.09999999999999431566 │
# │ 100.199997 ┆ 100.2 ┆ 100.19999694824218750000 ┆ 100.20000000000000284217 │
# │ 100.300003 ┆ 100.3 ┆ 100.30000305175781250000 ┆ 100.29999999999999715783 │
# │ 100.400002 ┆ 100.4 ┆ 100.40000152587890625000 ┆ 100.40000000000000568434 │
# │ 100.5      ┆ 100.5 ┆ 100.50000000000000000000 ┆ 100.50000000000000000000 │
# └────────────┴───────┴──────────────────────────┴──────────────────────────┘

💡 Important Concept

the computer cannot precisely represent most floating point numbers like 1.2 or 100.3 but only approximate them
the f64 type has double the bits so it can get closer to the real value and often "looks" correct

Solution

not sure how/what is the solution here
the "rounding" is technically correct but the representation is confusing... 🤔

Decimal

when you want "perfect precision" the common approach is to use the Decimal type
however, polars current Decimal Type is still work in progress (round for Decimal not yet supported #15151)

ritchie46 commented 3 months ago

I am not sure we need to take action on this. A DataFrame string representation shows you a concise information plot. If you require more control on how it is visualized you can set the float formatting options.

yusufuyanik1 commented 3 months ago

I am not sure if this is the same issue but this looks very inconsistent when i access the results

df = pl.DataFrame({"index": [1,2,3,4,5]})
df = df.with_columns(progress = pl.col("index") / pl.len())
df.get_column("progress").to_list()

This returns [0.2, 0.4, 0.6000000000000001, 0.8, 1.0]

pola-rs / polars