open-meteo / python-requests

Open-Meteo Python Library using `requests`
MIT License
15 stars 4 forks source link

Polars example #36

Closed MantraMedia closed 4 months ago

MantraMedia commented 11 months ago

Polars is an insanely fast DataFrame and well suited to analyse Open Meteo data.

If you are interested, I can create a merge request with an example like this on how to use it?

import polars as pl
from datetime import datetime

...

utc_offset_seconds = response.UtcOffsetSeconds()
start_time = hourly.Time() + utc_offset_seconds
end_time = hourly.TimeEnd() + utc_offset_seconds
interval_seconds = hourly.Interval()

start_datetime = datetime.utcfromtimestamp(start_time)
end_datetime = datetime.utcfromtimestamp(end_time)

interval_str = f"{interval_seconds}s"

df = pl.DataFrame({'date': pl.datetime_range(
    start=start_datetime,
    end=end_datetime,
    interval=interval_str,
    closed='left',
    eager=True,
)}).lazy()

for i, variable in enumerate(params["hourly"]):
    values = response.Hourly().Variables(i).ValuesAsNumpy()
    series = pl.Series(values)
    df = df.with_columns(series.alias(variable))

df = df.collect()
print(df)
patrick-zippenfenig commented 10 months ago

Hi, sorry for the late reply. Sure, a code example on how to use it in combination with Polars would be a great addition!

It looks very close to Pandas, can you quickly highlight the differences to Pandas? Thanks!

MantraMedia commented 10 months ago

Polars is written in Rust and exposes a Python interface.

It is highly optimized for performance with exhausting core utilization and memory management.

One of the most important parts is the lazy evaluation which uses a query optimizer before creating the result.

As in the example above you can see that I make the data frame lazy and then collect it.

An example with silly stats for Aspen in Colorado for the past 50 years (this has to come before the df.collect() to keep df in lazy state):

aspen_stats = (
    df
    # Filter rows where temperature is below 0 and it's snowing
    .filter((pl.col("temperature_2m") < 0) & (pl.col("snowfall") > 0))
    # Add a new column for wind chill factor and extract the week
    .with_columns([
        (13.12 + 0.6215 * pl.col("temperature_2m") - 11.37 * pl.col("wind_speed_10m").pow(0.16) + 0.3965 * pl.col("temperature_2m") * pl.col("wind_speed_10m").pow(0.16)).alias("wind_chill"),
        pl.col("date").dt.week().alias("WeatherWeek")
    ])
    # Group by the extracted week
    .group_by("WeatherWeek")
    .agg([
        pl.col("temperature_2m").mean().alias("avg_temperature"),
        pl.col("precipitation").sum().alias("total_precipitation"),
        pl.col("wind_chill").min().alias("min_wind_chill")
    ])
    # Sort by the extracted week
    .sort("WeatherWeek")
)

print(aspen_stats.collect())

The execution time is 3.46 milliseconds on a stock Hetzner Ryzen

I ditched Pandas hours after I first tried it so I can not say a lot but let chatgpt do the comparison:

| Feature                | Polars                                           | Pandas                                           |
|------------------------|--------------------------------------------------|--------------------------------------------------|
| Execution Model        | Supports both eager and lazy execution           | Eager execution                                  |
| Memory Management      | Memory-efficient with zero-copy operations       | Can be memory-intensive                          |
| Performance            | Generally faster, optimized for large datasets   | Slower with large datasets                       |
| API Design             | Streamlined API focused on performance           | Rich API, wide range of use cases                |
| Data Types             | Strong typing with Apache Arrow's data types     | Wide variety of data types                       |
| String Manipulation    | Highly optimized and vectorized string functions | Not vectorized by default                        |
| Multi-threading        | Multi-threaded, utilizes all CPU cores           | Single-threaded                                  |
| File I/O               | Fast I/O for large datasets, efficient with Parquet | Supports various formats, less efficient for big data |
| Window Functions & Group-By | Powerful and efficient window functions and group-by | Supports but can be less efficient            |
| Join Operations        | Fast join operations with SQL-like syntax        | SQL-like join operations                         |
| Query Optimizer        | Query optimizer in lazy mode                     | No explicit query optimizer                     |
github-actions[bot] commented 4 months ago

:tada: This issue has been resolved in version 1.2.1 :tada:

The release is available on:

Your semantic-release bot :package::rocket: