pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.44k stars 1.97k forks source link

Can polars implements OHLC feature? #381

Closed rts-gordon closed 3 years ago

rts-gordon commented 3 years ago

Hi there, Our team use python Pandas to calculate OHLC in a Trade system for a long time, now we want to use RUST to rebuild the system for higher performance, so we found Polars. But there is no OHLC functions in Polars, and would you like to implement OHLC in polars like this ohlc in Pandas

Thank you very much.

ritchie46 commented 3 years ago

Forgive me if I am wrong, but isn't OHLC not an aggregation of first,max,min, andlast`?

If so, you can do:

     fn example(df: &DataFrame) -> Result<DataFrame> {
         df.downsample("datetime", SampleRule::Minute(6))?
             .agg(&[("foo_ohlm", &["first", "max", "min", "last"])])?
             .sort("datetime", false)
    }

Or in python

def example(df):
        return df.downsample("a", rule="minute", n=5).agg({"b": ["first", "min", "max", "last"]})
rts-gordon commented 3 years ago

@ritchie46 Thanks for quick reply. Yes, OHLC means first/max/min/last, But I wondering if SampleRule in Polars support all time periods like 1min/5min/15min/30min/1hour/2hour/4hour/1week/1month? For example, process a tick at 2021-03-03 10:43:15, so the time periods would be: 1min: 2021-03-03 10:43:00; 5min: 2021-03-03 10:40:00; 15min: 2021-03-03 10:30:00; 30min: 2021-03-03 10:30:00; 1hour: 2021-03-03 10:00:00; 2hour: 2021-03-03 10:00:00; 4hour: 2021-03-03 08:00:00; 1week: 2021-03-01 00:00:00; 1month: 2021-03-01 00:00:00;

Regards CHCP

image

ritchie46 commented 3 years ago

All the time periods you mention can be composed by the SampleRule. So a week would be SampleRule::Day(7).

rts-gordon commented 3 years ago

Thanks @ritchie46 , I will test for SampleRule.

rts-gordon commented 3 years ago

Hi @ritchie46 ,

I have a csv file like this:

AUDCAD,20201001 23:58:49.724418,0.9545,0.95476,1
AUDCAD,20201001 23:58:49.780350,0.9545,0.95476,1
AUDCAD,20201001 23:58:49.826159,0.9545,0.95476,1
AUDCAD,20201001 23:58:49.860344,0.95449,0.95476,1
AUDCAD,20201001 23:58:50.163641,0.95449,0.9547,1
AUDCAD,20201001 23:58:50.186391,0.95447,0.95469,10
AUDCAD,20201001 23:58:50.238856,0.95449,0.95472,1

When I use CsvReader to read this file, how to define the string datetime in the schema filed? I use Time64/Date64, but it doesn't work. thanks a lot.

fn get_schema() -> Schema {
    Schema::new(vec![
        Field::new("s", DataType::Utf8),
        //Field::new("u", DataType::Utf8),
        //Field::new("u", DataType::Time64(TimeUnit::Millisecond)),
        Field::new("u", DataType::Date64),
        Field::new("c", DataType::Float64),
        Field::new("a", DataType::Float64),
        Field::new("v", DataType::UInt64),
    ])
}

pub async fn example() -> PolarResult<DataFrame> {
    let schema = get_schema();

  let df = CsvReader::from_path("./data/20201001.csv")?
        .with_schema(Arc::new(schema))
        .has_header(false)
        .finish()?;        
    debug!("df ==== {:?}", df);

    let res = df.downsample("datetime", SampleRule::Minute(1))?
        .agg(&[("c", &["first", "max", "min", "last"])])?
        .sort("datetime", false);
    debug!("res === {:?}", res);

    res
}

there are some errors:

thread 'thread 'thread '<unnamed>thread 'thread 'thread 'thread '<unnamed><unnamed><unnamed><unnamed>' panicked at 'thread '<unnamed>' panicked at '<unnamed>' panicked at '' panicked at 'called `Result::unwrap()` on an `Err` 
value: Other("Unsupported data type Date64 when reading a csv")called `Result::unwrap()` on an `Err` value: Other("Unsupported data type Date64 when reading a csv")<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Other("Unsupported data type Date64 when reading a csv")called `Result::unwrap()` on an `Err` value: Other("Unsupported data type Date64 when reading a csv")', called `Result::unwrap()` on an `Err` value: Other("Unsupported data type Date64 when reading a csv")', ', ' panicked at '' panicked at '' panicked at '', C:\Users\Gordon\.cargo\registry\src\github.com-1ecc6299db9ec823\polars-io-0.12.1\src\csv_core\csv.rs', :C:\Users\Gordon\.cargo\registry\src\github.com-1ecc6299db9ec823\polars-io-0.12.1\src\csv_core\csv.rscalled `Result::unwrap()` on an `Err` value: Other("Unsupported data type Date64 when reading a csv")called `Result::unwrap()` on an `Err` value: Other("Unsupported data type Date64 when reading a csv")called `Result::unwrap()` on an `Err` value: Other("Unsupported data type Date64 when reading a csv")C:\Users\Gordon\.cargo\registry\src\github.com-1ecc6299db9ec823\polars-io-0.12.1\src\csv_core\csv.rsC:\Users\Gordon\.cargo\registry\src\github.com-1ecc6299db9ec823\polars-io-0.12.1\src\csv_core\csv.rsC:\Users\Gordon\.cargo\registry\src\github.com-1ecc6299db9ec823\polars-io-0.12.1\src\csv_core\csv.rs168:', ', ', ::::168C:\Users\Gordon\.cargo\registry\src\github.com-1ecc6299db9ec823\polars-io-0.12.1\src\csv_core\csv.rsC:\Users\Gordon\.cargo\registry\src\github.com-1ecc6299db9ec823\polars-io-0.12.1\src\csv_core\csv.rsC:\Users\Gordon\.cargo\registry\src\github.com-1ecc6299db9ec823\polars-io-0.12.1\src\csv_core\csv.rs16816816890::90::::
ritchie46 commented 3 years ago

Yes, at the moment you first have to parse the Date64 fields as Utf8 type. Later you can cast them to Date64, with your required fmt

rts-gordon commented 3 years ago

Hi @ritchie46 I try to use Utf8 for the column 'u', the string datetime, but the following code doesn't work, can you please give me an example, thank you.

    let res = df.downsample("datetime", SampleRule::Minute(1))?
        .agg(&[("c", &["first", "max", "min", "last"])])?
        .sort("datetime", false);
ritchie46 commented 3 years ago

Hi, I would like to, but I really do not understand how to parse your date column? :confused:

What would the date of this 20201001 23:58:49.724418 be?

ritchie46 commented 3 years ago

I have assumed a parsing fmt for convenience.

This is a OHLC downsample to seconds. I've used seconds here because it is a bit more interesting result.

use polars::frame::resample::SampleRule;
use polars::prelude::*;
use std::io::Cursor;

fn get_schema() -> Schema {
    Schema::new(vec![
        Field::new("s", DataType::Utf8),
        Field::new("u", DataType::Utf8),
        Field::new("c", DataType::Float64),
        Field::new("a", DataType::Float64),
        Field::new("v", DataType::UInt64),
    ])
}

fn run() -> Result<DataFrame> {
    let data = r#"AUDCAD,20201001 23:58:49.724418,0.9545,0.95476,1
AUDCAD,20201001 23:58:49.780350,0.9545,0.95476,1
AUDCAD,20201001 23:58:49.826159,0.9545,0.95476,1
AUDCAD,20201001 23:58:49.860344,0.95449,0.95476,1
AUDCAD,20201001 23:58:50.163641,0.95449,0.9547,1
AUDCAD,20201001 23:58:50.186391,0.95447,0.95469,10
AUDCAD,20201001 23:58:50.238856,0.95449,0.95472,1
"#;
    let file = Cursor::new(data);

    let schema = get_schema();

    let mut df = CsvReader::new(file)
        .with_schema(Arc::new(schema))
        .has_header(false)
        .finish()?;

    // cast column 'u' from utf8 to date64
    // parse fmt for datetime
    let cast_fmt = Some("%Y%m%d %H:%M:%S%.6f");
    df.may_apply("u", |s| s.utf8()?.as_date64(cast_fmt))?;

    dbg!(&df);

    let res = df
        .downsample("u", SampleRule::Second(1))?
        .agg(&[("c", &["first", "max", "min", "last"])])?
        .sort("u", false)?;

    dbg!(&res);

    Ok(res)
}

pub fn main() {
    run().expect("failed");
}

This outputs:

[src/main.rs:38] &df = shape: (7, 5)
╭──────────┬─────────────────────────┬───────┬───────┬─────╮
│ s        ┆ u                       ┆ c     ┆ a     ┆ v   │
│ ---      ┆ ---                     ┆ ---   ┆ ---   ┆ --- │
│ str      ┆ date64(ms)              ┆ f64   ┆ f64   ┆ u64 │
╞══════════╪═════════════════════════╪═══════╪═══════╪═════╡
│ "AUDCAD" ┆ 2020-10-01 23:58:49.724 ┆ 0.955 ┆ 0.955 ┆ 1   │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ "AUDCAD" ┆ 2020-10-01 23:58:49.780 ┆ 0.955 ┆ 0.955 ┆ 1   │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ "AUDCAD" ┆ 2020-10-01 23:58:49.826 ┆ 0.955 ┆ 0.955 ┆ 1   │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ "AUDCAD" ┆ 2020-10-01 23:58:49.860 ┆ 0.954 ┆ 0.955 ┆ 1   │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ "AUDCAD" ┆ 2020-10-01 23:58:50.163 ┆ 0.954 ┆ 0.955 ┆ 1   │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ "AUDCAD" ┆ 2020-10-01 23:58:50.186 ┆ 0.954 ┆ 0.955 ┆ 10  │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ "AUDCAD" ┆ 2020-10-01 23:58:50.238 ┆ 0.954 ┆ 0.955 ┆ 1   │
╰──────────┴─────────────────────────┴───────┴───────┴─────╯
[src/main.rs:45] &res = shape: (2, 5)
╭─────────────────────┬─────────┬───────┬───────┬────────╮
│ u                   ┆ c_first ┆ c_max ┆ c_min ┆ c_last │
│ ---                 ┆ ---     ┆ ---   ┆ ---   ┆ ---    │
│ date64(ms)          ┆ f64     ┆ f64   ┆ f64   ┆ f64    │
╞═════════════════════╪═════════╪═══════╪═══════╪════════╡
│ 2020-10-01 23:58:49 ┆ 0.955   ┆ 0.955 ┆ 0.954 ┆ 0.954  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2020-10-01 23:58:50 ┆ 0.954   ┆ 0.954 ┆ 0.954 ┆ 0.954  │
╰─────────────────────┴─────────┴───────┴───────┴────────╯
rts-gordon commented 3 years ago

It is works. Many many thanks to you, @ritchie46 I will study your code carefully. "Polars" is a very powerful project, need more time to learn it.

ritchie46 commented 3 years ago

Great. If you've got any more questions let me know.