legion-labs / legion

Legion monorepo, contains the legion engine, the tools, source control, build system etc...
https://book.legionengine.com
Other
16 stars 2 forks source link

investigate deltalake #1755

Open mad-legion opened 2 years ago

mad-legion commented 2 years ago

https://crates.io/crates/deltalake

https://github.com/delta-io/delta/blob/master/PROTOCOL.md

mad-legion commented 2 years ago

   ...: import pandas as pd
   ...:
   ...: path_to_table = "d:/temp/delta/mytable"
   ...:
   ...: df = pd.DataFrame({'x': [1, 2, 3]})
   ...: deltalake.writer.write_deltalake(path_to_table, df)
   ...:
   ...: df = pd.DataFrame({'x': [4, 5, 6]})
   ...: deltalake.writer.write_deltalake(path_to_table, df, mode = 'append')
   ...:
   ...: dt = deltalake.DeltaTable(path_to_table)
   ...:

In [2]: df = dt.to_pandas()

In [3]: df
Out[3]:
   x
0  1
1  2
2  3
3  4
4  5
5  6```
mad-legion commented 2 years ago

this generated two parquet files

05/24/2022  09:12 AM    <DIR>          .
05/24/2022  09:12 AM    <DIR>          ..
05/24/2022  09:12 AM             1,652 0-11e8473d-ac00-426a-b149-dce5445597bb-0.parquet
05/24/2022  09:12 AM             1,652 1-d06d5dbd-c0be-475c-b19a-641c59e4cc93-0.parquet
05/24/2022  09:12 AM    <DIR>          _delta_log
mad-legion commented 2 years ago
async fn test_lakehouse_query() -> Result<()> {
    let _telemetry_guard = TelemetryGuard::default().unwrap();
    let table_path = "d:/temp/cache/tables/3F5F22FF-445B-2156-96F6-3F8CA984968E/spans";
    let table = deltalake::open_table(&table_path).await?;
    let ctx = SessionContext::new();
    ctx.register_table("spans", Arc::new(table))?;
    let batches = ctx
        .sql("SELECT count(*) FROM spans where begin_ms > 5000")
        .await?
        .collect()
        .await?;
    dbg!(batches);
    Ok(())
}

in the directory, there are 1874 files for a total of 4.5 Gb. The test executes in 0.53 seconds... not bad (the answer is 149341791)