jmcnamara / rust_xlsxwriter

A Rust library for creating Excel XLSX files.
https://crates.io/crates/rust_xlsxwriter
Apache License 2.0
316 stars 25 forks source link

Example: write Polars Dataframe to Excel #39

Closed jmcnamara closed 1 year ago

jmcnamara commented 1 year ago

Sample code

Here is a simple example of writing a Polars rust dataframe to Excel using rust_xlsxwriter >= 0.40.0:

screenshot

And here is the code:

use chrono::prelude::*;
use polars::export::arrow::temporal_conversions::*;
use polars::prelude::*;
use rust_xlsxwriter::{Format, Table, TableColumn, Workbook, XlsxError};

fn main() {
    let df: DataFrame = df!(
        "String" => &["North", "South", "East", "West", "All"],
        "Integer" => &[1, 2, 3, 4, 5],
        "Datetime" => &[
            NaiveDate::from_ymd_opt(2022, 1, 1).unwrap().and_hms_opt(1, 0, 0).unwrap(),
            NaiveDate::from_ymd_opt(2022, 1, 2).unwrap().and_hms_opt(2, 0, 0).unwrap(),
            NaiveDate::from_ymd_opt(2022, 1, 3).unwrap().and_hms_opt(3, 0, 0).unwrap(),
            NaiveDate::from_ymd_opt(2022, 1, 4).unwrap().and_hms_opt(4, 0, 0).unwrap(),
            NaiveDate::from_ymd_opt(2022, 1, 5).unwrap().and_hms_opt(5, 0, 0).unwrap(),
        ],
        "Date" => &[
            NaiveDate::from_ymd_opt(2022, 1, 1).unwrap(),
            NaiveDate::from_ymd_opt(2022, 1, 2).unwrap(),
            NaiveDate::from_ymd_opt(2022, 1, 3).unwrap(),
            NaiveDate::from_ymd_opt(2022, 1, 4).unwrap(),
            NaiveDate::from_ymd_opt(2022, 1, 5).unwrap(),
        ],
        "Time" => &[
            NaiveTime::from_hms_milli_opt(2, 59, 3, 456).unwrap(),
            NaiveTime::from_hms_milli_opt(2, 59, 3, 456).unwrap(),
            NaiveTime::from_hms_milli_opt(2, 59, 3, 456).unwrap(),
            NaiveTime::from_hms_milli_opt(2, 59, 3, 456).unwrap(),
            NaiveTime::from_hms_milli_opt(2, 59, 3, 456).unwrap(),
        ],
        "Float" => &[4.0, 5.0, 6.0, 7.0, 8.0],
    )
    .expect("should not fail");

    write_xlsx(df, "dataframe.xlsx").unwrap();
}

// Simplified Polars dataframe to Excel file converter. See the actual
// `write_xlsx()` method in Polars for a much more complete and flexible
// interface.
//
// https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.DataFrame.write_excel.html#polars.DataFrame.write_excel
//
fn write_xlsx(df: DataFrame, filename: &str) -> Result<(), XlsxError> {
    // Create a new Excel file object.
    let mut workbook = Workbook::new();
    let mut headers = vec![];

    // Create some formats for the dataframe.
    let datetime_format = Format::new().set_num_format("yyyy\\-mm\\-dd\\ hh:mm:ss");
    let date_format = Format::new().set_num_format("yyyy\\-mm\\-dd;@");
    let time_format = Format::new().set_num_format("hh:mm:ss;@");

    // Add a worksheet to the workbook.
    let worksheet = workbook.add_worksheet();

    // Iterate through the dataframe column by column.
    for (col_num, column) in df.get_columns().iter().enumerate() {
        let col_num = col_num as u16;

        // Store the column names for use as table headers.
        headers.push(column.name().to_string());

        // Write the row data for each column/type.
        for (row_num, data) in column.iter().enumerate() {
            let row_num = 1 + row_num as u32;

            // Map the Polars Series AnyValue types to Excel/rust_xlsxwriter
            // types.
            match data {
                AnyValue::Int8(value) => {
                    worksheet.write_number(row_num, col_num, value)?;
                }
                AnyValue::UInt8(value) => {
                    worksheet.write_number(row_num, col_num, value)?;
                }
                AnyValue::Int16(value) => {
                    worksheet.write_number(row_num, col_num, value)?;
                }
                AnyValue::UInt16(value) => {
                    worksheet.write_number(row_num, col_num, value)?;
                }
                AnyValue::Int32(value) => {
                    worksheet.write_number(row_num, col_num, value)?;
                }
                AnyValue::UInt32(value) => {
                    worksheet.write_number(row_num, col_num, value)?;
                }
                AnyValue::Float32(value) => {
                    worksheet.write_number(row_num, col_num, value)?;
                }
                AnyValue::Float64(value) => {
                    worksheet.write_number(row_num, col_num, value)?;
                }
                AnyValue::Utf8(value) => {
                    worksheet.write_string(row_num, col_num, value)?;
                }
                AnyValue::Boolean(value) => {
                    worksheet.write_boolean(row_num, col_num, value)?;
                }
                AnyValue::Null => {
                    // Treat Null as blank for now.
                }
                AnyValue::Datetime(value, time_units, _) => {
                    let datetime = match time_units {
                        TimeUnit::Nanoseconds => timestamp_ns_to_datetime(value),
                        TimeUnit::Microseconds => timestamp_us_to_datetime(value),
                        TimeUnit::Milliseconds => timestamp_ms_to_datetime(value),
                    };
                    worksheet.write_datetime(row_num, col_num, &datetime, &datetime_format)?;
                    worksheet.set_column_width(col_num, 18)?;
                }
                AnyValue::Date(value) => {
                    let date = date32_to_date(value);
                    worksheet.write_date(row_num, col_num, &date, &date_format)?;
                    worksheet.set_column_width(col_num, 10)?;
                }
                AnyValue::Time(value) => {
                    let time = time64ns_to_time(value);
                    worksheet.write_time(row_num, col_num, &time, &time_format)?;
                }
                _ => {
                    println!(
                        "WARNING: AnyValue data type '{}' is not supported",
                        data.dtype()
                    );
                    break;
                }
            }
        }
    }

    // Create a table for the dataframe range.
    let (max_row, max_col) = df.shape();
    let mut table = Table::new();
    let columns: Vec<TableColumn> = headers
        .into_iter()
        .map(|x| TableColumn::new().set_header(x))
        .collect();
    table.set_columns(&columns);

    // Add the table to the worksheet.
    worksheet.add_table(0, 0, max_row as u32, max_col as u16 - 1, &table)?;

    // Autofit the columns.
    worksheet.autofit();

    // Save the file to disk.
    workbook.save(filename)?;

    Ok(())
}

The Cargo.toml file for this is:

[package]
name = "dataframe_test"
version = "0.1.0"
edition = "2021"

[dependencies]
chrono = "0.4.24"
polars = { version = "0.29.0", features = ["lazy"] }
rust_xlsxwriter = "0.40.0"

Which was set up as follows:

cargo add polars -F lazy
cargo add chrono
cargo add rust_xlsxwriter

Note, this is for demonstration/testing purposes only. Polars already has a really nice and very configurable dataframe.write_excel() API that uses the Python based XlsxWriter. I even wrote some docs on Working with Polars and XlsxWriter.

jmcnamara commented 1 year ago

For comparison I used the following Polars/Pandas write_xlsx benchmark from @alexander-beedie at https://github.com/pola-rs/polars/issues/5568#issuecomment-1526316286

from codetiming import Timer
from datetime import date
import polars as pl

# quickly spin-up a 1,000,000 element DataFrame
df = pl.DataFrame({
    "idx": range(250_000),
    "x": 123.456789, 
    "y": date.today(), 
    "z":"testing,1.2.3.4."}
)

# export to Excel from polars
with Timer():
    df.write_excel( "dataframe_pl.xlsx" )

# export to Excel from pandas
pf = df.to_pandas()
with Timer():
    pf.to_excel( "dataframe_pd.xlsx", index=False )

And simulated it in Rust as close as possible with this:

use chrono::prelude::*;
use polars::export::arrow::temporal_conversions::*;
use polars::prelude::*;
use rust_xlsxwriter::{Format, FormatBorder, Workbook, XlsxError};

use std::time::Instant;

const DATA_SIZE: usize = 250_000;

fn main() {
    let df: DataFrame = df!(
        "idx" => &[1; DATA_SIZE],
        "x" => &[123.456789; DATA_SIZE],
        "y" => &[NaiveDate::from_ymd_opt(2023, 1, 1).unwrap(); DATA_SIZE],
        "z" => &["testing,1.2.3.4."; DATA_SIZE],
    )
    .expect("should not fail");

    let before = Instant::now();
    write_excel(df, "dataframe.xlsx").unwrap();
    println!("Elapsed time: {:.2?}", before.elapsed());
}

// `write_excel()` function same as above.
// fn write_excel(df: DataFrame, filename: &str) -> Result<(), XlsxError> {...}

My test machine is slower than the other benchmark machine but these are the results I got:

library time taken (s)
polars 8.6
pandas 15.2
rust_xlsxwriter 1.2
rust_xlsxwriter + zlib 0.98

This probably isn't a completely fair comparison since Polars is doing more checking/work but it is a reasonable ballpark indicator. Also, I'll restate my note from above that the dataframe.write_xlsx() API is much more fully featured and you should continue to use that. The code above is for example purposes only.

jmcnamara commented 1 year ago

This method is no longer required.

I've uploaded a new crate called polars_excel_writer for serializing Polars dataframes into Excel Xlsx files using rust_xlsxwriter as a backend engine.

It provides two interfaces for writing a dataframe to an Excel Xlsx file:

alexander-beedie commented 1 year ago

Was thinking about this again the other day; it occurs to me that it might be something we can integrate into the Polars CLI, where we are a little less sensitive about dependencies and it could make for a useful addition 🤔

jmcnamara commented 1 year ago

@alexander-beedie Thanks for considering this, that would be great. I still need to do some work to add more configurability and hopefully get some more real world testing. I'll circle back in a month or so when it is a bit further on.

leonkosak commented 6 months ago

This method is no longer required.

I've uploaded a new crate called polars_excel_writer for serializing Polars dataframes into Excel Xlsx files using rust_xlsxwriter as a backend engine.

It provides two interfaces for writing a dataframe to an Excel Xlsx file:

  • ExcelWriter a simple Excel serializer that implements the Polars SerWriter trait to write a dataframe to an Excel Xlsx file. This is similar to the CsvWriter interface.
  • PolarsXlsxWriter a more configurable Excel serializer that more closely resembles the interface options provided by the Polars Python write_excel() dataframe method. There is still work in progress for this interface.

@jmcnamara and @alexander-beedie, how to use polars_excel_writer with polars.DataFrame.write_excel function in Python? I have some very large exports to xlsx with 200k+ rows and cca 290 columns, which produces ~90MB xlsx file in pandas xlsx export (exporting time is greater than 10 minutes). I noticed that just fetching data takes 80-85 seconds in pandas v1.5.3. With polars v0.20.10 it takes only 23-25 seconds, which is amazing. However, using polars.DataFrame.write_excel function is not much faster - around 20-25%, but output file is much larger (around 190MB). Based on this time measurements, I think that speedup limitation is "original" XlsxWriter. I also do not understand why xlsx file produced by polars is so much larger than pandas, even if the same engine (XlsxWriter) is used. Thank you for explanations.

jmcnamara commented 6 months ago

I also do not understand why xlsx file produced by polars is so much larger than pandas, even if the same engine (XlsxWriter) is used.

The Polars implementation uses some additional cell formats for floats and integers that the Pandas implementation doesn't. This adds an additional 6 byte xml attribute per-cell of data written which for a large dataframe can turn into a larger file size. This in turn probably adds a (probably small) performance overhead in compression and zipping the file into the target xlsx file.

However, the increase in size and the extra compression overhead would probably be unnoticed/insignificant for most users. The counter advantage is that the Excel output from Polars is more formatted and consistent.

leonkosak commented 6 months ago

Thank you @jmcnamara for your great explanation regarding xlsx file size! 👍

What about polars_excel_writer (using rust_xlsxwriter) implementation of original XlsxWriter? How can I replace engine implementation, which should be used inside polars.DataFrame.write_excel in Python? Currently, I don't need any specific functionality for exports from database do xlsx, but I would like to get additional enormous performance benefit while creating xlsx file (information provided here).

jmcnamara commented 6 months ago

How can I replace engine implementation, which should be used inside polars.DataFrame.write_excel in Python?

There isn't any way to do that currently. That would required a Python wrapper around rust_xlsxwriter.

leonkosak commented 6 months ago

Thank you @jmcnamara? As far as I understand, there should also be support in polars.DataFrame.write_excel to define other engines (like in pandas.to_excel() method)?

jmcnamara commented 6 months ago

As far as I understand, there should also be support in polars.DataFrame.write_excel to define other engines (like in pandas.to_excel() method)?

No. As far as I know that won’t be part of the interface.