Closed pxp9 closed 1 year ago
The error is the key. You're providing an integer, not a string as the macro expects.
Ok , I already what is the error, I am using csv
crate to parse a CSV file that has this form
concept_id_1 concept_id_2 relationship_id valid_start_date valid_end_date invalid_reason
3667577 40642538 Has status 20220128 20991231 no_reason
3667577 35631990 Has Module 20220128 20991231
46001351 40642538 Has status 20220128 20991231
The main problem is that the date is interpreted as integer by the csv
crate.
Maybe I need to capture that as String and use parse function in order to get a Date.
Thank you very much, @jhpratt
The csv
crate is correctly interpreting that as a number. If you want it to be a string, it needs to be quoted.
@jhpratt, I have done a trick, I have Deserialize the value as a String. So csv
crate read it as integer but convert that to a String.
#[derive(Debug, Deserialize)]
struct Record {
concept_id_1: i32,
concept_id_2: i32,
relationship_id: String,
valid_start_date: String,
valid_end_date: String,
invalid_reason: Option<String>,
}
and after that
I can convert it to a Date.
let concept = Concept {
concept_id_1: record.concept_id_1,
concept_id_2: record.concept_id_2,
relationship_id: record.relationship_id,
valid_start_date: Date::parse(&record.valid_start_date, FORMAT).unwrap(),
valid_end_date: Date::parse(&record.valid_end_date, FORMAT).unwrap(),
invalid_reason: record.invalid_reason,
};
Let me know if you know a better way to do this.
I need to be this really fast because it has to parse 46 million data.
Thank you @jhpratt
I'm not super familiar with csv
as I've only used it a few times myself. But if that code works, then why not use it? :slightly_smiling_face:
That works fine, but I am not sure if can be faster enough in order to not die waiting for parsing 46 M records, maybe should I take other stategy ?
Have you tried running it? 46 million isn't that much in the grand scheme of things. There will naturally be some overhead to validate the data, but that's pretty much it.
Have you tried running it? 46 million isn't that much in the grand scheme of things. There will naturally be some overhead to validate the data, but that's pretty much it
I am running it, I think it will take 30 minutes to parse and print it on standard input, I will tell you when finishes.
Probably print it on standard input takes more time than actually parsing.
Are you using release mode? And perhaps try writing to a file instead of stdout — that will be quicker, as your terminal won't have to display you everything.
Are you using release mode? And perhaps try writing to a file instead of stdout — that will be quicker, as your terminal won't have to display you everything.
I dont want to write a File because original CSV file weights 2 GBs.
but thank you for the advice of the release mode, I think it will do some optimizations that will benefit performance
Just what I have estimated 27 minutes, pretty good :)
Yeah, if you're ever testing for speed, use release mode. Optimizations are huge.
Yeah, if you're ever testing for speed, use release mode. Optimizations are huge.
Tested with optimizations, I have gain 10 minutes.
(27 60+56) / (16 60 + 32 ) = 1.6895 , which is a 68.9 % speed increase.
@jhpratt thank you for your knowledge.
Dear maintainers,
first at all, this crate is really cool.
My doubt is about the usage of
format_description!
macro because I do not know how to parse this kind of dates20220128
to get the year=2022 month=01 day=28.
The main issue I have found is when you define a format like this one
Takes all the numbers in year variable as you can see in the following error:
(I think is doing this: year=20220128 )
The following format gives also the same error beacuse it takes every number in year variable.
I have read the docs , but I did not find nothing about specifying there is no whitespaces or characters between components.
Could someone help me please ?
Thank you in advance.