pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
28.66k stars 1.8k forks source link

Inconsistent Round behavior #15898

Open ek-ex opened 3 months ago

ek-ex commented 3 months ago

Checks

Reproducible example


import polars as pl

df = pl.read_csv('AAPL.csv', has_header=False, try_parse_dates=True, new_columns=['timestamp',"open","high","low","close","volume"], 
                    dtypes={"open": pl.Float32, "high": pl.Float32, "low": pl.Float32, "close": pl.Float32, "volume": pl.Float32})
df
timestamp open high low close volume
datetime[μs] f32 f32 f32 f32 f32
2005-01-03 08:00:00 0.9979 0.9984 0.9979 0.9984 45594.0
2005-01-03 08:02:00 0.9903 0.9903 0.9903 0.9903 354001.0
2005-01-03 08:03:00 0.9995 0.9996 0.9995 0.9996 19540.0
2005-01-03 08:04:00 1.0003 1.0026 1.0003 1.0026 187845.0
2005-01-03 08:07:00 1.0012 1.0012 1.001 1.001 58620.0
2024-04-19 19:40:00 164.399994 164.399994 164.399994 164.399994 100.0
2024-04-19 19:43:00 164.430099 164.430099 164.430099 164.430099 600.0
2024-04-19 19:44:00 164.429993 164.440002 164.429993 164.440002 383.0
2024-04-19 19:47:00 164.479996 164.479996 164.479996 164.479996 445.0
2024-04-19 19:48:00 164.479996 164.479996 164.429993 164.449997 600.0
df.with_columns(
    pl.col('open').round(3)
)
timestamp open high low close volume
datetime[μs] f32 f32 f32 f32 f32
2005-01-03 08:00:00 0.998 0.9984 0.9979 0.9984 45594.0
2005-01-03 08:02:00 0.99 0.9903 0.9903 0.9903 354001.0
2005-01-03 08:03:00 0.999 0.9996 0.9995 0.9996 19540.0
2005-01-03 08:04:00 1.0 1.0026 1.0003 1.0026 187845.0
2005-01-03 08:07:00 1.001 1.0012 1.001 1.001 58620.0
2024-04-19 19:40:00 164.399994 164.399994 164.399994 164.399994 100.0
2024-04-19 19:43:00 164.429993 164.430099 164.430099 164.430099 600.0
2024-04-19 19:44:00 164.429993 164.440002 164.429993 164.440002 383.0
2024-04-19 19:47:00 164.479996 164.479996 164.479996 164.479996 445.0
2024-04-19 19:48:00 164.479996 164.479996 164.429993 164.449997 600.0
df.with_columns(
    pl.col('open').round(1)
)
timestamp open high low close volume
datetime[μs] f32 f32 f32 f32 f32
2005-01-03 08:00:00 1.0 0.9984 0.9979 0.9984 45594.0
2005-01-03 08:02:00 1.0 0.9903 0.9903 0.9903 354001.0
2005-01-03 08:03:00 1.0 0.9996 0.9995 0.9996 19540.0
2005-01-03 08:04:00 1.0 1.0026 1.0003 1.0026 187845.0
2005-01-03 08:07:00 1.0 1.0012 1.001 1.001 58620.0
2024-04-19 19:40:00 164.399994 164.399994 164.399994 164.399994 100.0
2024-04-19 19:43:00 164.399994 164.430099 164.430099 164.430099 600.0
2024-04-19 19:44:00 164.399994 164.440002 164.429993 164.440002 383.0
2024-04-19 19:47:00 164.5 164.479996 164.479996 164.479996 445.0
2024-04-19 19:48:00 164.5 164.479996 164.429993 164.449997 600.0

df.with_columns(
    pl.col('open').round(2)
)
timestamp open high low close volume
datetime[μs] f32 f32 f32 f32 f32
2005-01-03 08:00:00 1.0 0.9984 0.9979 0.9984 45594.0
2005-01-03 08:02:00 0.99 0.9903 0.9903 0.9903 354001.0
2005-01-03 08:03:00 1.0 0.9996 0.9995 0.9996 19540.0
2005-01-03 08:04:00 1.0 1.0026 1.0003 1.0026 187845.0
2005-01-03 08:07:00 1.0 1.0012 1.001 1.001 58620.0
2024-04-19 19:40:00 164.399994 164.399994 164.399994 164.399994 100.0
2024-04-19 19:43:00 164.429993 164.430099 164.430099 164.430099 600.0
2024-04-19 19:44:00 164.429993 164.440002 164.429993 164.440002 383.0
2024-04-19 19:47:00 164.479996 164.479996 164.479996 164.479996 445.0
2024-04-19 19:48:00 164.479996 164.479996 164.429993 164.449997 600.0

Log output

No response

Issue description

I'm using the round function in a column of floats with many decimals. However, sometimes the round is working as expected, sometimes it isnt.

Expected behavior

Round function consistently applied to all rows.

Installed versions

``` --------Version info--------- Polars: 0.20.18 Index type: UInt32 Platform: Linux-6.5.0-28-generic-x86_64-with-glibc2.35 Python: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] ----Optional dependencies---- adbc_driver_manager: cloudpickle: connectorx: deltalake: fastexcel: fsspec: 2024.3.1 gevent: hvplot: matplotlib: 3.8.4 nest_asyncio: 1.6.0 numpy: 1.26.4 openpyxl: pandas: pyarrow: pydantic: pyiceberg: pyxlsb: sqlalchemy: xlsx2csv: xlsxwriter: ```
Julian-J-S commented 3 months ago

hi @ek-ex

I can reproduce this.

the problem though is not the round function but rather the display/formatting of f32 data.

pl.Config.set_fmt_str_lengths(100)

DATA = [1.0, 1.2, 1.3, 1.4, 1.5, 100.1, 100.2, 100.3, 100.4, 100.5]

pl.DataFrame(
    {"f32": DATA, "f64": DATA},
    schema={"f32": pl.Float32, "f64": pl.Float64},
).with_columns(
    f32_decimals=pl.col("f32").map_elements(lambda x: f"{x:.20f}"),
    f64_decimals=pl.col("f64").map_elements(lambda x: f"{x:.20f}"),
)

# shape: (10, 4)
# ┌────────────┬───────┬──────────────────────────┬──────────────────────────┐
# │ f32        ┆ f64   ┆ f32_decimals             ┆ f64_decimals             │
# │ ---        ┆ ---   ┆ ---                      ┆ ---                      │
# │ f32        ┆ f64   ┆ str                      ┆ str                      │
# ╞════════════╪═══════╪══════════════════════════╪══════════════════════════╡
# │ 1.0        ┆ 1.0   ┆ 1.00000000000000000000   ┆ 1.00000000000000000000   │
# │ 1.2        ┆ 1.2   ┆ 1.20000004768371582031   ┆ 1.19999999999999995559   │
# │ 1.3        ┆ 1.3   ┆ 1.29999995231628417969   ┆ 1.30000000000000004441   │
# │ 1.4        ┆ 1.4   ┆ 1.39999997615814208984   ┆ 1.39999999999999991118   │
# │ 1.5        ┆ 1.5   ┆ 1.50000000000000000000   ┆ 1.50000000000000000000   │
# │ 100.099998 ┆ 100.1 ┆ 100.09999847412109375000 ┆ 100.09999999999999431566 │
# │ 100.199997 ┆ 100.2 ┆ 100.19999694824218750000 ┆ 100.20000000000000284217 │
# │ 100.300003 ┆ 100.3 ┆ 100.30000305175781250000 ┆ 100.29999999999999715783 │
# │ 100.400002 ┆ 100.4 ┆ 100.40000152587890625000 ┆ 100.40000000000000568434 │
# │ 100.5      ┆ 100.5 ┆ 100.50000000000000000000 ┆ 100.50000000000000000000 │
# └────────────┴───────┴──────────────────────────┴──────────────────────────┘

💡 Important Concept

Solution

Decimal

ritchie46 commented 3 months ago

I am not sure we need to take action on this. A DataFrame string representation shows you a concise information plot. If you require more control on how it is visualized you can set the float formatting options.

yusufuyanik1 commented 3 months ago

I am not sure if this is the same issue but this looks very inconsistent when i access the results

df = pl.DataFrame({"index": [1,2,3,4,5]})
df = df.with_columns(progress = pl.col("index") / pl.len())
df.get_column("progress").to_list()

This returns [0.2, 0.4, 0.6000000000000001, 0.8, 1.0]