time-rs / time

The most used Rust library for date and time handling.
https://time-rs.github.io
Apache License 2.0
1.06k stars 261 forks source link

Doubt about format_description macro #584

Closed pxp9 closed 1 year ago

pxp9 commented 1 year ago

Dear maintainers,

first at all, this crate is really cool.

My doubt is about the usage of format_description! macro because I do not know how to parse this kind of dates

20220128

to get the year=2022 month=01 day=28.

The main issue I have found is when you define a format like this one

serde::format_description!(version = 2 , my_format, Date, "[year][month][day]");

Takes all the numbers in year variable as you can see in the following error:

(I think is doing this: year=20220128 )

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Error(Deserialize { pos: Some(Position { byte: 89, line: 2, record: 1 }), err: DeserializeError { field: None, kind: Message("invalid type: integer `20220128`, expected a(n) `Date` in the format \"[year][month][day]\"") } })', src/main.rs:27:37
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

The following format gives also the same error beacuse it takes every number in year variable.

serde::format_description!(version = 2 , my_format, Date, "[year]");

I have read the docs , but I did not find nothing about specifying there is no whitespaces or characters between components.

Could someone help me please ?

Thank you in advance.

jhpratt commented 1 year ago

The error is the key. You're providing an integer, not a string as the macro expects.

pxp9 commented 1 year ago

Ok , I already what is the error, I am using csv crate to parse a CSV file that has this form

concept_id_1 concept_id_2 relationship_id valid_start_date valid_end_date invalid_reason 3667577 40642538 Has status 20220128 20991231 no_reason 3667577 35631990 Has Module 20220128 20991231
46001351 40642538 Has status 20220128 20991231

The main problem is that the date is interpreted as integer by the csv crate.

Maybe I need to capture that as String and use parse function in order to get a Date.

Thank you very much, @jhpratt

jhpratt commented 1 year ago

The csv crate is correctly interpreting that as a number. If you want it to be a string, it needs to be quoted.

pxp9 commented 1 year ago

@jhpratt, I have done a trick, I have Deserialize the value as a String. So csv crate read it as integer but convert that to a String.


#[derive(Debug, Deserialize)]
struct Record {
    concept_id_1: i32,
    concept_id_2: i32,
    relationship_id: String,
    valid_start_date: String,
    valid_end_date: String,
    invalid_reason: Option<String>,
}

and after that

I can convert it to a Date.


let concept = Concept {
            concept_id_1: record.concept_id_1,
            concept_id_2: record.concept_id_2,
            relationship_id: record.relationship_id,
            valid_start_date: Date::parse(&record.valid_start_date, FORMAT).unwrap(),
            valid_end_date: Date::parse(&record.valid_end_date, FORMAT).unwrap(),
            invalid_reason: record.invalid_reason,
        };

Let me know if you know a better way to do this.

I need to be this really fast because it has to parse 46 million data.

Thank you @jhpratt

jhpratt commented 1 year ago

I'm not super familiar with csv as I've only used it a few times myself. But if that code works, then why not use it? :slightly_smiling_face:

pxp9 commented 1 year ago

That works fine, but I am not sure if can be faster enough in order to not die waiting for parsing 46 M records, maybe should I take other stategy ?

jhpratt commented 1 year ago

Have you tried running it? 46 million isn't that much in the grand scheme of things. There will naturally be some overhead to validate the data, but that's pretty much it.

pxp9 commented 1 year ago

Have you tried running it? 46 million isn't that much in the grand scheme of things. There will naturally be some overhead to validate the data, but that's pretty much it

I am running it, I think it will take 30 minutes to parse and print it on standard input, I will tell you when finishes.

Probably print it on standard input takes more time than actually parsing.

jhpratt commented 1 year ago

Are you using release mode? And perhaps try writing to a file instead of stdout — that will be quicker, as your terminal won't have to display you everything.

pxp9 commented 1 year ago

Are you using release mode? And perhaps try writing to a file instead of stdout — that will be quicker, as your terminal won't have to display you everything.

I dont want to write a File because original CSV file weights 2 GBs.

but thank you for the advice of the release mode, I think it will do some optimizations that will benefit performance

pxp9 commented 1 year ago

image

Just what I have estimated 27 minutes, pretty good :)

jhpratt commented 1 year ago

Yeah, if you're ever testing for speed, use release mode. Optimizations are huge.

pxp9 commented 1 year ago

Yeah, if you're ever testing for speed, use release mode. Optimizations are huge.

image

Tested with optimizations, I have gain 10 minutes.

(27 60+56) / (16 60 + 32 ) = 1.6895 , which is a 68.9 % speed increase.

@jhpratt thank you for your knowledge.