Performance improvements

Tristramg commented 2 years ago

Relates to #111 This doesn’t really work: the clock time is arround the same. Copying the whole data is probably way to expensive

m0n0chr0m3 commented 2 years ago

Hi Tristram,

Thanks for your continued efforts to maintain this crate!

Your approach does appear to work here: the change properly parallelizes gtfs_reader::read_objs. It takes the runtime from ±55s to <40s on my machine. Profiling confirms that it does what it's supposed to:

During the time frame selected in the screenshot, gtfs_reader::read_objs is executing and is utilizing all cores.

Consider now this screenshot, where the next part of the runtime is selected:

During this time frame, execution is finishing up gtfs_reader::read_objs. (Not visible in the screenshot, but) Here the most time is spent in csv::**_record::trim methods and in calls to free. It might be possible to speed things up here, but I don't see any low-hanging fruit.

Finally, consider this screenshot of the last part of execution:

During this time frame, execution has progressed past the point where the 'raw' structure is constructed (and thus when gtfs_reader::read_objs executes at all). Essentially, the latter half of RawGtfs::new(gtfs).and_then(Gtfs::try_from) is now executing. Speeding up gtfs_reader::read_objs won't help at all. Perhaps you know a way to improve Gtfs::try_from?

Tristramg commented 2 years ago

Out of curiosity, what are you using for your profiling/visualizing?

Thank you for your feedback and your analysis, that is quite right. I made more serious measurements on a server instead of my laptop to avoid interference from other apps. I get consistent performance improvements as you noticed

I still managed to improve 3-4% better performances rewriting parse_time (a bit ugly and I sidetracked a lot on that). I will now focus on try_from. It seems that the problem comes from the building the trips. Maybe using binary trees instead of hash maps can help.

Reading duration in ms, 10 runs every time

IDFM 69Mb
* better parsing 9632 9633 9664 9627 9687 9608 9645 9855 9689 9714
* parallel 9958 9922 10001 9930 9985 9998 9993 10023 9999 9977
* original 13166 13262 13209 13083 13136 13079 13166 13170 13134 13124

NL 225Mb
* better parsing 18949 18771 18770 18838 18746 18900 18789 18782 18688 18812
* parallel 19589 19636 19574 19526 19670 19526 19558 19573 19717 19595
* original 24450 24651 24493 24972 24562 24536 24433 24436 24573 24599

PID 33Mb
* better parsing 3293 3268 3278 3284 3308 3287 3258 3267 3293 3274
* parallel 3371 3378 3386 3372 3367 3361 3347 3360 3377 3386
* original 4213 4226 4195 4186 4203 4190 4214 4182 4215 4201

Tristramg commented 2 years ago

For having a trace: following stuff didn’t work out

using BTreeMap instead of HashMap
using AHashMap
Using group_by to iterate over the stops to avoid re-looking up for every Trip

m0n0chr0m3 commented 2 years ago

Out of curiosity, what are you using for your profiling/visualizing?

Out of habit, I'm using the Instruments.app profiler that comes with XCode on Mac OS. In essence a GUI for DTrace.

Looking at it a bit more (with higher-frequency profiling instead of the default time interval), I think 'trimming' in general is a huge cost in the current version. Switching .trim(csv::Trim::Fields) to .trim(csv::Trim::None) in the body of read_objs does indeed bring a significant performance increase... albeit of course at the cost of different behavior.

Still, since it makes that much of a difference in the runtime I checked how the csv crate implements trimming and it turns out there is a real opportunity for imprevement there: currently trimming a ByteRecord always allocates a new ByteRecord, even if there is nothing to trim (and trimming probably wouldn't require reallocating either). A PR to amend this problem is stuck on what appears to be a spurious CI failure in the csv crate's repo.

Tristramg commented 2 years ago

Nice digging! I tried indeed with csv = { git="https://github.com/do4gr/rust-csv.git", branch="non_allocating_trim" } and I gain about 20% on the GTFS from netherlands (from 18.8 down to 15.5 seconds).

I left a message on the PR, maybe that will get things going. I tried to progress on create_trips but I can’t find a way to make it quicker. It’s quite frustrating :disappointed:

m0n0chr0m3 commented 2 years ago

As an alternative approach (borderline large enough to be a PR instead of a comment):

If you replace the following lines in read_objs:

    let recs  = reader
        .records()
        .map(|rec| {
            rec.map_err(|e| Error::CSVError {
                file_name: file_name.to_owned(),
                source: e,
                line_in_error: None,
            })
        })
        .collect::<Result<Vec<_>, _>>()?;
    recs.par_iter()
        .map(|rec| {
            rec.deserialize(Some(&headers)).map_err(|e| Error::CSVError {
                file_name: file_name.to_owned(),
                source: e,
                line_in_error: Some(crate::error::LineError {
                    headers: headers.into_iter().map(|s| s.to_owned()).collect(),
                    values: rec.into_iter().map(|s| s.to_owned()).collect(),
                }),
            })
        })
        .collect()

by the following:

    // Pre-allocate a StringRecord
    let mut rec = csv::StringRecord::new();
    let mut objs = Vec::new();

    // Read each record into the pre-allocated StringRecord one at a time
    while reader.read_record(&mut rec)
        .map_err(|e| Error::CSVError {
            file_name: file_name.to_owned(),
            source: e,
            line_in_error: None,
        })?
    {
        let obj = rec.deserialize(Some(&headers)).map_err(|e| Error::CSVError {
            file_name: file_name.to_owned(),
            source: e,
            line_in_error: Some(crate::error::LineError {
                headers: headers.into_iter().map(|s| s.to_owned()).collect(),
                values: rec.into_iter().map(|s| s.to_owned()).collect(),
            }),
        })?;
        objs.push(obj);
    }
    Ok(objs)

then you

are no longer using parallelism, thus
no longer need the dependency on rayon, but
are about as fast as the parallel version (on my machine), but
by using a pre-allocated csv::StringRecord make it slightly harder to add parallelism again at a later point.

I also feel the frustration on the create_trips side of things. Lots going on there. And your current design seems solid.

Tristramg commented 2 years ago

Nice, however in my case I observed that the performance is indeed better than the original, but not as good as with rayon. So I might prefer to stick with rayon, even if pulling a new dependency for a small change is probably exagerated.

Speaking of new dependencies, I toyed with Smol-str and smartstrings, as Id are generally quite small. There is a very small improvement (~1%). Memory usage should also be down, but I haven’t measured it yet

Tristramg commented 2 years ago

@antoine-de would you like to have a look at it?

m0n0chr0m3 commented 2 years ago

the performance is indeed better than the original, but not as good as with rayon.

Okay. Do you have the infrastructure set up to compare memory footprint? I would expect that the parallel version (which at some point has all StringRecords in memory at the same time) has a significantly larger memory footprint than the sequential version that only has one StringRecord at a time. Of course, that memory use isn't going to be orders of magnitude larger than the memory use of the RawGtfs structures that are built from the records, so perhaps it doesn't matter in practice.

Optimizing for small strings might be a good idea too. Won't that change be visible in the public API, though?

Tristramg commented 2 years ago

You are right. I got a bit lost in the rabbithole trying to track any miminal performance improvement.

I started from a clean sheet, using your suggestion and being a bit less agressive on #114 , without new dependency nor allocation. It’s already better.

I added an option to be able to ignore the trimming at all when you are confident with your data.

Completely different topic: do you want to be added as a maintainer? We are only two working on it, and latetly without being paid for it. So an extra set of eyes can be helpful.

Tristramg commented 2 years ago

Closing as it was partially solved in the other PR

rust-transit / gtfs-structure

Performance improvements #112