tafia / calamine

A pure Rust Excel/OpenDocument SpreadSheets file reader: rust on metal sheets
MIT License
1.61k stars 155 forks source link

Improve date format parsing #303

Closed dimastbk closed 1 year ago

dimastbk commented 1 year ago

Improve date format parsing, added in #198. This logic ported from openpyxl, which is under MIT License. It also passes almost all tests from SheetJS, except one.

If you merge #299 and #296 before this PR, I can refactor this PR to support XLSX/XLSB.

tafia commented 1 year ago

A variant more imperative but probably faster (read the characters only once)

fn is_datetime(s: &str) -> bool {
    let mut escaped = false;
    let mut is_quote = false;
    let mut brackets = 0u8;
    let mut prev = ' ';
    let mut hms = false;
    for s in s.chars() {
        match (s, escaped, is_quote, brackets) {
            (_, true, ..) => escaped = false, // if escaped, ignore
            ('_' | '\\', ..) => escaped = true,
            ('"', _, true, _) => is_quote = false,
            (_, _, true, _) => (),
            ('"', _, _, _) => is_quote = true,
            (';', ..) => return false, // first format only
            ('[', ..) => brackets += 1,
            (']', .., 1) if hms => return true, // if closing
            (']', ..) => brackets = brackets.saturating_sub(1),
            ('d' | 'm' | 'h' | 'y' | 's' | 'D' | 'M' | 'H' | 'Y' | 'S', _, _, 0) => return true,
            _ => {
                if hms && s.eq_ignore_ascii_case(&prev) {
                    // ok ...
                } else {
                    hms = prev == '[' && matches!(s, 'm' | 'h' | 's' | 'M' | 'H' | 'S');
                }
            }
        }
        prev = s;
    }
    false
}
dimastbk commented 1 year ago

Thanks for review! Sorry for regex :(

test bench_regex  ... bench:      11,279 ns/iter (+/- 186)
test bench_first  ... bench:       3,873 ns/iter (+/- 51)
test bench_second ... bench:       1,357 ns/iter (+/- 32)

I have run all proposals with test_is_date_format (with small fix for second - I have added detecting of string with A/P (AM/PM) only). Should I replace is_custom_date_format with code above? This implementation passes all tests from sheetjs, except one (same that original).

pub fn is_custom_date_format(format: &str) -> bool {
    let mut escaped = false;
    let mut is_quote = false;
    let mut brackets = 0u8;
    let mut prev = ' ';
    let mut hms = false;
    let mut ap = false;
    for s in format.chars() {
        match (s, escaped, is_quote, ap, brackets) {
            (_, true, ..) => escaped = false, // if escaped, ignore
            ('_' | '\\', ..) => escaped = true,
            ('"', _, true, _, _) => is_quote = false,
            (_, _, true, _, _) => (),
            ('"', _, _, _, _) => is_quote = true,
            (';', ..) => return false, // first format only
            ('[', ..) => brackets += 1,
            (']', .., 1) if hms => return true, // if closing
            (']', ..) => brackets = brackets.saturating_sub(1),
            ('a' | 'A', _, _, false, 0) => ap = true,
            ('p' | 'm' | '/' | 'P' | 'M', _, _, true, 0) => return true,
            ('d' | 'm' | 'h' | 'y' | 's' | 'D' | 'M' | 'H' | 'Y' | 'S', _, _, false, 0) => {
                return true
            }
            _ => {
                if hms && s.eq_ignore_ascii_case(&prev) {
                    // ok ...
                } else {
                    hms = prev == '[' && matches!(s, 'm' | 'h' | 's' | 'M' | 'H' | 'S');
                }
            }
        }
        prev = s;
    }
    false
}
tafia commented 1 year ago

Yes please go ahead. Thanks a lot!

dimastbk commented 1 year ago

I updated this pr.

this module should be conditional to dates feature (#[cfg(feature = "dates")])

I think, it doesn't make sense with the new implementation.

tafia commented 1 year ago

Thanks

tafia commented 1 year ago

Thanks

fzumstein commented 1 year ago

Maybe worthwhile pointing out that this PR broke handling of date overflows (e.g., 1e+20 formatted as Date). For a test file: https://github.com/pandas-dev/pandas/blob/main/pandas/tests/io/data/excel/testdateoverflow.xlsx

dimastbk commented 1 year ago

Before this PR:

Range {
    start: (0, 0),
    end: (3, 1),
    inner: [
        String("DateColWithBigInt"),
        String("StringCol"),
        Float(42441.0),
        String("Marc Johnson"),
        Float(42445.0),
        String("Jack Black"),
        Float(1e20),
        String("Timothy Brown"),
    ],
}

After this PR:

Range {
    start: (0, 0),
    end: (3, 1),
    inner: [
        String("DateColWithBigInt"),
        String("StringCol"),
        DateTime(42441.0),
        String("Marc Johnson"),
        DateTime(42445.0),
        String("Jack Black"),
        DateTime(1e20),
        String("Timothy Brown"),
    ],
}

Hm, maybe add the ability to extract float/string value from datetime/duration, like below?

    /// Try getting float value
    pub fn get_float(&self) -> Option<f64> {
        if let DataType::Float(v) | DataType::DateTime(v) = self {
            Some(*v)
        } else {
            None
        }
    }
...
    /// Try getting string value
    pub fn get_string(&self) -> Option<&str> {
        if let DataType::String(v) | DataType::DateTimeIso(v) | DataType::DurationIso(v) = self {
            Some(&**v)
        } else {
            None
        }
    }
fzumstein commented 1 year ago

Right, thanks for pointing out the real change! in this case, I think all is good!