vincentlaucsb / csv-parser

A high-performance, fully-featured CSV parser and serializer for modern C++.
MIT License
864 stars 144 forks source link

Improve reading performance #193

Closed cla93 closed 2 years ago

cla93 commented 2 years ago

Hello, I did not know where to ask for this, hope it is not a problem to post in here.

Looking at the performance on the README of this parser I see 2.1 seconds for parsing 1.4GB of CSV data.

When I try it on a CSV of 750MB, with 9 columns and 1'118'481 rows, it takes almost 30 seconds to parse it.

It follows the code I use

CSVReader reader(path_stops);
    vector<StopTimes_struct> stops_times;
    for (auto it = reader.begin(); it != reader.end(); ++it) {
        StopTimes_struct tmp_stop_times;
        tmp_stop_times.trip_id = (*it)["trip_id"].get();
        tmp_stop_times.stop_id = (*it)["stop_id"].get();
        tmp_stop_times.arrival_time = (*it)["arrival_time"].get();
        tmp_stop_times.departure_time = (*it)["departure_time"].get();
        tmp_stop_times.stop_sequence = (*it)["stop_sequence"].get<int>();
        stops_times.emplace_back(tmp_stop_times);
    }

I tried for a moment to delete the part which save in struct and vector, but there is no improvement.

Therefore, my question: is there something I'm doing wrong ? or it depends on the fact that almost every field I take is a string and only one is a number?

Thank you

vincentlaucsb commented 2 years ago

You're using get() which constructs a string which is an expensive operation.

image

get<int>() also forces the parser to parse the string to determine if it is an integer or not.