Closed dimastbk closed 1 year ago
A variant more imperative but probably faster (read the characters only once)
fn is_datetime(s: &str) -> bool {
let mut escaped = false;
let mut is_quote = false;
let mut brackets = 0u8;
let mut prev = ' ';
let mut hms = false;
for s in s.chars() {
match (s, escaped, is_quote, brackets) {
(_, true, ..) => escaped = false, // if escaped, ignore
('_' | '\\', ..) => escaped = true,
('"', _, true, _) => is_quote = false,
(_, _, true, _) => (),
('"', _, _, _) => is_quote = true,
(';', ..) => return false, // first format only
('[', ..) => brackets += 1,
(']', .., 1) if hms => return true, // if closing
(']', ..) => brackets = brackets.saturating_sub(1),
('d' | 'm' | 'h' | 'y' | 's' | 'D' | 'M' | 'H' | 'Y' | 'S', _, _, 0) => return true,
_ => {
if hms && s.eq_ignore_ascii_case(&prev) {
// ok ...
} else {
hms = prev == '[' && matches!(s, 'm' | 'h' | 's' | 'M' | 'H' | 'S');
}
}
}
prev = s;
}
false
}
Thanks for review! Sorry for regex :(
test bench_regex ... bench: 11,279 ns/iter (+/- 186)
test bench_first ... bench: 3,873 ns/iter (+/- 51)
test bench_second ... bench: 1,357 ns/iter (+/- 32)
I have run all proposals with test_is_date_format
(with small fix for second - I have added detecting of string with A/P (AM/PM) only). Should I replace is_custom_date_format
with code above? This implementation passes all tests from sheetjs, except one (same that original).
pub fn is_custom_date_format(format: &str) -> bool {
let mut escaped = false;
let mut is_quote = false;
let mut brackets = 0u8;
let mut prev = ' ';
let mut hms = false;
let mut ap = false;
for s in format.chars() {
match (s, escaped, is_quote, ap, brackets) {
(_, true, ..) => escaped = false, // if escaped, ignore
('_' | '\\', ..) => escaped = true,
('"', _, true, _, _) => is_quote = false,
(_, _, true, _, _) => (),
('"', _, _, _, _) => is_quote = true,
(';', ..) => return false, // first format only
('[', ..) => brackets += 1,
(']', .., 1) if hms => return true, // if closing
(']', ..) => brackets = brackets.saturating_sub(1),
('a' | 'A', _, _, false, 0) => ap = true,
('p' | 'm' | '/' | 'P' | 'M', _, _, true, 0) => return true,
('d' | 'm' | 'h' | 'y' | 's' | 'D' | 'M' | 'H' | 'Y' | 'S', _, _, false, 0) => {
return true
}
_ => {
if hms && s.eq_ignore_ascii_case(&prev) {
// ok ...
} else {
hms = prev == '[' && matches!(s, 'm' | 'h' | 's' | 'M' | 'H' | 'S');
}
}
}
prev = s;
}
false
}
Yes please go ahead. Thanks a lot!
I updated this pr.
this module should be conditional to dates feature (#[cfg(feature = "dates")])
I think, it doesn't make sense with the new implementation.
Thanks
Thanks
Maybe worthwhile pointing out that this PR broke handling of date overflows (e.g., 1e+20 formatted as Date). For a test file: https://github.com/pandas-dev/pandas/blob/main/pandas/tests/io/data/excel/testdateoverflow.xlsx
Before this PR:
Range {
start: (0, 0),
end: (3, 1),
inner: [
String("DateColWithBigInt"),
String("StringCol"),
Float(42441.0),
String("Marc Johnson"),
Float(42445.0),
String("Jack Black"),
Float(1e20),
String("Timothy Brown"),
],
}
After this PR:
Range {
start: (0, 0),
end: (3, 1),
inner: [
String("DateColWithBigInt"),
String("StringCol"),
DateTime(42441.0),
String("Marc Johnson"),
DateTime(42445.0),
String("Jack Black"),
DateTime(1e20),
String("Timothy Brown"),
],
}
Hm, maybe add the ability to extract float/string value from datetime/duration, like below?
/// Try getting float value
pub fn get_float(&self) -> Option<f64> {
if let DataType::Float(v) | DataType::DateTime(v) = self {
Some(*v)
} else {
None
}
}
...
/// Try getting string value
pub fn get_string(&self) -> Option<&str> {
if let DataType::String(v) | DataType::DateTimeIso(v) | DataType::DurationIso(v) = self {
Some(&**v)
} else {
None
}
}
Right, thanks for pointing out the real change! in this case, I think all is good!
Improve date format parsing, added in #198. This logic ported from openpyxl, which is under MIT License. It also passes almost all tests from SheetJS, except one.
If you merge #299 and #296 before this PR, I can refactor this PR to support XLSX/XLSB.