tafia / calamine

A pure Rust Excel/OpenDocument SpreadSheets file reader: rust on metal sheets
MIT License
1.61k stars 155 forks source link

Error when reading a `#VALUE!` cell #317

Open lucatrv opened 1 year ago

lucatrv commented 1 year ago

I am using a sample Excel file with 1 at position (0, 0) and #VALUE! at position (1, 0).

Considering the following sample code:

use anyhow::{anyhow, Context, Result};
use calamine::{open_workbook, DataType, RangeDeserializerBuilder, Reader, Xlsx};

fn main() -> Result<()> {
    let mut workbook: Xlsx<_> = open_workbook("excel.xlsx")?;
    let range = workbook
        .worksheet_range_at(0)
        .expect("workbook should contain at least one worksheet")?;
    let mut iter = RangeDeserializerBuilder::new()
        .has_headers(false)
        .from_range(&range)?;
    if let Some(result) = iter.next() {
        let (value1, value2): (DataType, DataType) = result.context("failed to read values")?;
        println!("{:?} {:?}", value1, value2);
        Ok(())
    } else {
        Err(anyhow!("expected at least one record but got none"))
    }
}

I would expect to get value1 as DataType::Float and value1 as DataType::Error. Instead the code exits early at result.context("failed to read value")? with the following error:

Error: failed to read values

Caused by:
    Cell error at position '(1, 0)': #VALUE!

I tried both with xlsx and xls formats, the result is the same.

tafia commented 1 year ago

There is an automatic conversion from DataType::Error to DeError::CellError. It seems like a good default.

Are you sure you want to deserialize the ranges? If you're working with DataType, why not reading the cells directly?

for row in range.rows() {
    if let &[value1, value2] = &row {
        // ...
    }
}
lucatrv commented 1 year ago

Are you sure you want to deserialize the ranges? If you're working with DataType, why not reading the cells directly?

That's because in my real code I'm skipping some columns, so I'm using RangeDeserializerBuilder::with_headers. Moreover I'm deserializing some columns to String, while two of them may be either a date a string or a #VALUE!, so I'm deserializing them to DataType to match them later. Since DataType has an Error variant, I was expecting to get it for #VALUE! cells.

Would you suggest me to use range.rows() for my use case, or otherwise Serde's deserialize_with field attribute? Otherwise if you agree with my approach, there should not be automatic conversion from DataType::Error to DeError::CellError.

tafia commented 1 year ago

I would suggest rows, you won't get any much perf improvements by skipping a column because data is encoded by row anyway.

For the deserialization behavior, I suppose it could be an option of the RangeDeserializer. In general I don't think people deserialize to DataType and as a result failing on #VALUE makes sense.

lucatrv commented 1 year ago

I will change my code to rows(), thanks for your advice.

However IMHO if one deserializes to anything but DataType, then it is ok to fail on #VALUE, but if one chooses to deserialize to DataType (maybe only one column among many), then he expects to get the full range of DataType variants, including Error. Therefore IMHO an option is not needed, and the default behavior should be changed only when deserializing to DataType.