jmcnamara / polars_excel_writer

A Polars extension to serialize dataframes to Excel xlsx files
Apache License 2.0
13 stars 4 forks source link

feature request: <map to supported integers> #4

Closed claudiofsr closed 8 months ago

claudiofsr commented 1 year ago

Feature Request

Good morning!!!

I Try to use

fn write_xlsx(df: &DataFrame) -> PolarsResult<()> {
    let mut xlsx_writer = PolarsXlsxWriter::new();

    xlsx_writer.write_dataframe(df)?;
    xlsx_writer.save("df_result.xlsx")?;

    Ok(())
}

And I get this Error:

Error: ComputeError(ErrString("Polars AnyValue data type 'i64' is not supported by Excel"))

Is it possible to map 'i64' and all unsupported integers to 'u32' supported by Excel?

jmcnamara commented 1 year ago

I can probably add a i32::try_from() and then only panic/warn if the conversation fails. I’ll look into it.

claudiofsr commented 1 year ago

I try this:

fn write_xlsx(df: &DataFrame, sheet_name: &str) -> Result<(), Box<dyn Error>> {
    let mut xlsx_writer = PolarsXlsxWriter::new();

    let df_formatted: DataFrame = to_supported_integers(df)?;

    xlsx_writer.write_dataframe(&df_formatted)?;
    xlsx_writer.set_worksheet_name(sheet_name)?;
    xlsx_writer.save("df_output.xlsx")?;

    Ok(())
}

/// Format 'u64' and 'i64' integers to 'u32' supported by Excel
fn to_supported_integers(dataframe: &DataFrame) -> Result<DataFrame, Box<dyn Error>> {
    Ok(
        dataframe
            .clone()
            .lazy()
            .with_columns([
                all()
                .apply(format_to_u32, GetOutput::same_type())
            ])
            .collect()?
    )
}

fn format_to_u32(series: Series) -> Result<Option<Series>, PolarsError> {
    if unsupported_integer(series.dtype()) {
        Ok(Some(series.cast(&DataType::UInt32)?))
    } else {
        Ok(Some(series))
    }
}

fn unsupported_integer(t: &DataType) -> bool {
    matches!(t, DataType::UInt64 | DataType::Int64)
}
jmcnamara commented 1 year ago

I try this:

That looks like a good workaround. For my information how did you end up with i/u64 in your dataframe? Did you create it it from CSV or similar. I've seen some Polars integer interpretations to i/u64 by default.

claudiofsr commented 1 year ago

I open two csv files to cross-reference data. join_with_assignments

I believe that by default Polars reads integers like Int64. But I could have initially changed Int64 to I32! I hadn't planned on the possibility of copying the DataFrame as an xlsx file.

As we know, Polars and parquet do not have the limits of Excel.

claudiofsr commented 12 months ago

So far I have added the truncate_series function.

use polars_excel_writer::PolarsXlsxWriter;
use rust_xlsxwriter::{Workbook, Format, FormatAlign};

pub fn write_xlsx(df: &DataFrame, basename: &str, sheet_name: &str) -> PolarsResult<()> {

    let mut filepath = PathBuf::from(basename);
    filepath.set_extension("xlsx");
    println!("Write DataFrame to {filepath:?}\n");

    let df_to_excel: DataFrame = format_to_excel(&df)?;

    let header_fmt: Format = Format::new()
        .set_align(FormatAlign::Center) // horizontally
        .set_align(FormatAlign::VerticalCenter)
        .set_text_wrap();
        //.set_font_size(FONT_SIZE);

    // Create a new workbook and worksheet using `rust_xlsxwriter`.
    let mut workbook = Workbook::new();
    let worksheet = workbook
        .add_worksheet()
        .set_name(sheet_name)?
        .set_row_height(0, 60)?
        .set_row_format(0, &header_fmt)?
        .set_freeze_panes(1, 0)?;

    // Write the dataframe to the worksheet using `PolarsXlsxWriter`.
    let mut xlsx_writer = PolarsXlsxWriter::new();
    xlsx_writer.write_dataframe_to_worksheet(&df_to_excel, worksheet, 0, 0)?;

    // Save the file to disk.
    workbook.save(filepath)?;

    Ok(())
}

/// Format data supported by Excel
fn format_to_excel(data_frame: &DataFrame) -> Result<DataFrame, PolarsError> {
    let df_formated: DataFrame = data_frame
        .clone()
        .lazy()
        .with_columns([
            all()
            .apply(format_data, GetOutput::same_type())
        ])
        .collect()?;

    Ok(df_formated)
}

/// Format DataType
fn format_data(series: Series) -> Result<Option<Series>, PolarsError> {
    match series.dtype() {
        DataType::Int64  => Ok(Some(series.cast(&DataType::Int32)?)),
        DataType::UInt64 => Ok(Some(series.cast(&DataType::UInt32)?)),
        DataType::Utf8   => truncate_series(series), // to_n_chars(series)
        _ => Ok(Some(series)),
    }
}

fn truncate_series(series: Series) -> Result<Option<Series>, PolarsError> {
    let new_series: Series = series
        .utf8()?
        .into_iter()
        .map(|option_str: Option<&str>|
            option_str.map(|s| truncate_string(s, 32767)) // 2 ^ 15 - 1
        )
        .collect::<Utf8Chunked>()
        .into_series();

    Ok(Some(new_series))
}

// https://stackoverflow.com/questions/38461429/how-can-i-truncate-a-string-to-have-at-most-n-characters
fn truncate_string(s: &str, max_chars: usize) -> &str {
    match s.char_indices().nth(max_chars) {
        Some((idx, _)) => &s[..idx],
        None => s,
    }
}
jmcnamara commented 8 months ago

I've dded support for writing u64 and i64 number within Excel's limitations. This implies a loss of precision outside Excel's integer range of +/- 999,999,999,999,999 (15 digits).

It is available in v0.5.0.