Closed Tristramg closed 2 years ago
Hi Tristram,
Thanks for your continued efforts to maintain this crate!
Your approach does appear to work here: the change properly parallelizes gtfs_reader::read_objs
. It takes the runtime from ±55s to <40s on my machine. Profiling confirms that it does what it's supposed to:
During the time frame selected in the screenshot, gtfs_reader::read_objs
is executing and is utilizing all cores.
Consider now this screenshot, where the next part of the runtime is selected:
During this time frame, execution is finishing up gtfs_reader::read_objs
. (Not visible in the screenshot, but) Here the most time is spent in csv::**_record::trim
methods and in calls to free
. It might be possible to speed things up here, but I don't see any low-hanging fruit.
Finally, consider this screenshot of the last part of execution:
During this time frame, execution has progressed past the point where the 'raw' structure is constructed (and thus when gtfs_reader::read_objs
executes at all). Essentially, the latter half of RawGtfs::new(gtfs).and_then(Gtfs::try_from)
is now executing. Speeding up gtfs_reader::read_objs
won't help at all. Perhaps you know a way to improve Gtfs::try_from
?
Out of curiosity, what are you using for your profiling/visualizing?
Thank you for your feedback and your analysis, that is quite right. I made more serious measurements on a server instead of my laptop to avoid interference from other apps. I get consistent performance improvements as you noticed
I still managed to improve 3-4% better performances rewriting parse_time
(a bit ugly and I sidetracked a lot on that). I will now focus on try_from
. It seems that the problem comes from the building the trips. Maybe using binary trees instead of hash maps can help.
Reading duration in ms, 10 runs every time
IDFM 69Mb
* better parsing 9632 9633 9664 9627 9687 9608 9645 9855 9689 9714
* parallel 9958 9922 10001 9930 9985 9998 9993 10023 9999 9977
* original 13166 13262 13209 13083 13136 13079 13166 13170 13134 13124
NL 225Mb
* better parsing 18949 18771 18770 18838 18746 18900 18789 18782 18688 18812
* parallel 19589 19636 19574 19526 19670 19526 19558 19573 19717 19595
* original 24450 24651 24493 24972 24562 24536 24433 24436 24573 24599
PID 33Mb
* better parsing 3293 3268 3278 3284 3308 3287 3258 3267 3293 3274
* parallel 3371 3378 3386 3372 3367 3361 3347 3360 3377 3386
* original 4213 4226 4195 4186 4203 4190 4214 4182 4215 4201
For having a trace: following stuff didn’t work out
group_by
to iterate over the stops to avoid re-looking up for every TripOut of curiosity, what are you using for your profiling/visualizing?
Out of habit, I'm using the Instruments.app profiler that comes with XCode on Mac OS. In essence a GUI for DTrace.
Looking at it a bit more (with higher-frequency profiling instead of the default time interval), I think 'trimming' in general is a huge cost in the current version. Switching .trim(csv::Trim::Fields)
to .trim(csv::Trim::None)
in the body of read_objs
does indeed bring a significant performance increase... albeit of course at the cost of different behavior.
Still, since it makes that much of a difference in the runtime I checked how the csv
crate implements trimming and it turns out there is a real opportunity for imprevement there: currently trimming a ByteRecord
always allocates a new ByteRecord
, even if there is nothing to trim (and trimming probably wouldn't require reallocating either). A PR to amend this problem is stuck on what appears to be a spurious CI failure in the csv
crate's repo.
Nice digging!
I tried indeed with csv = { git="https://github.com/do4gr/rust-csv.git", branch="non_allocating_trim" }
and I gain about 20% on the GTFS from netherlands (from 18.8 down to 15.5 seconds).
I left a message on the PR, maybe that will get things going.
I tried to progress on create_trips
but I can’t find a way to make it quicker. It’s quite frustrating :disappointed:
As an alternative approach (borderline large enough to be a PR instead of a comment):
If you replace the following lines in read_objs
:
let recs = reader
.records()
.map(|rec| {
rec.map_err(|e| Error::CSVError {
file_name: file_name.to_owned(),
source: e,
line_in_error: None,
})
})
.collect::<Result<Vec<_>, _>>()?;
recs.par_iter()
.map(|rec| {
rec.deserialize(Some(&headers)).map_err(|e| Error::CSVError {
file_name: file_name.to_owned(),
source: e,
line_in_error: Some(crate::error::LineError {
headers: headers.into_iter().map(|s| s.to_owned()).collect(),
values: rec.into_iter().map(|s| s.to_owned()).collect(),
}),
})
})
.collect()
by the following:
// Pre-allocate a StringRecord
let mut rec = csv::StringRecord::new();
let mut objs = Vec::new();
// Read each record into the pre-allocated StringRecord one at a time
while reader.read_record(&mut rec)
.map_err(|e| Error::CSVError {
file_name: file_name.to_owned(),
source: e,
line_in_error: None,
})?
{
let obj = rec.deserialize(Some(&headers)).map_err(|e| Error::CSVError {
file_name: file_name.to_owned(),
source: e,
line_in_error: Some(crate::error::LineError {
headers: headers.into_iter().map(|s| s.to_owned()).collect(),
values: rec.into_iter().map(|s| s.to_owned()).collect(),
}),
})?;
objs.push(obj);
}
Ok(objs)
then you
csv::StringRecord
make it slightly harder to add parallelism again at a later point.I also feel the frustration on the create_trips
side of things. Lots going on there. And your current design seems solid.
Nice, however in my case I observed that the performance is indeed better than the original, but not as good as with rayon. So I might prefer to stick with rayon, even if pulling a new dependency for a small change is probably exagerated.
Speaking of new dependencies, I toyed with Smol-str
and smartstrings
, as Id are generally quite small. There is a very small improvement (~1%). Memory usage should also be down, but I haven’t measured it yet
@antoine-de would you like to have a look at it?
the performance is indeed better than the original, but not as good as with rayon.
Okay. Do you have the infrastructure set up to compare memory footprint? I would expect that the parallel version (which at some point has all StringRecord
s in memory at the same time) has a significantly larger memory footprint than the sequential version that only has one StringRecord
at a time. Of course, that memory use isn't going to be orders of magnitude larger than the memory use of the RawGtfs
structures that are built from the records, so perhaps it doesn't matter in practice.
Optimizing for small strings might be a good idea too. Won't that change be visible in the public API, though?
You are right. I got a bit lost in the rabbithole trying to track any miminal performance improvement.
I started from a clean sheet, using your suggestion and being a bit less agressive on #114 , without new dependency nor allocation. It’s already better.
I added an option to be able to ignore the trimming at all when you are confident with your data.
Completely different topic: do you want to be added as a maintainer? We are only two working on it, and latetly without being paid for it. So an extra set of eyes can be helpful.
Closing as it was partially solved in the other PR
Relates to #111 This doesn’t really work: the clock time is arround the same. Copying the whole data is probably way to expensive