Optimize the realtime trips updater

jamespfennell commented 1 year ago

For the new NYC buses transit system (https://github.com/jamespfennell/transiter/pull/97), I'm seeing GTFS realtime feed updates taking up to 7 seconds. To be clear, this is probably not a huge deal from a features perspective because the accuracy of transit data is O(30 seconds) so having slightly stale data is fine. And the update process is in the background so it doesn't impact HTTP latency. The main problem is that for those 7 seconds Transiter is probably using a non-trivial amount of CPU/memory/IO, and thus requires a larger machine for deployment. By analogy, when the Python to Go migration was merged the minimum requirements to run Transiter with the NYC subway system dropped from ~2.5 CPUs to ~0.5 CPUs because the update process became 5 times more efficient. It's literally cheaper to run Transiter now.

As a first pass I ran pprof on a Transiter instance with the new system:

go tool pprof "http://localhost:8082/debug/pprof/profile?seconds=300"

which dumped a new profile in my~/pprof directory. (Note that you can always take a prprof profile of a running Tranister instance using the admin HTTP endpoint /debug/pprof/profile.) This is the flamegraph, limited to the UpdateFeed function which drives the feed update:

There are three big consumers of CPU time, in decreasing order:

db.(*Queries).UpdateTripStopTime - this function sends an update SQL query to Postgres and is invoked once for every single stop time update in the feed. This is a lot of stop time updates! I suspect the reason this is taking so long isn't because the SQL query is slow. It's because of the overhead of calling Postgres from the Go code. It would be awesome to batch these SQL queries somehow. This is non-trivial though because it's an update query and Postgres/sqlc doesn't seem to offer a simple solution here.
db.(*Queries).InsertServiceMapStop - this inserts a single node in a service map. Again, batching this would probably help dramatically. This is easier than the last case because it's an insert query, so we can probably use sqlc's support for CopyFrom.
common.Hash - this takes a hash of the GTFS data for a single trip, and uses that to decide if the trip needs to be updated. If not, the trip update is skipped. This was submitted in https://github.com/jamespfennell/transiter/commit/b1bc7e19544f9d8138316e1d05252a20168cd946 and overall is a pretty good optimization - it halves the time needed to update the NYC subway. However there may be a way to make the hashing faster.

jamespfennell commented 1 year ago

Optimization 2 seems really easy: https://github.com/jamespfennell/transiter/commit/e947f724a08729bcf9b3fb5362e7a0b8794b4656!

jamespfennell commented 1 year ago

With this commit the InsertServiceMapStop portion of the flamegraph becomes tiny. Going to merge that commit.

jamespfennell commented 1 year ago

I did a first pass at the stops times too in https://github.com/jamespfennell/transiter/commit/119c65170627c3eabca213035c3455a624702f37. The results here look good, though not amazing. The update times seem to be halved compared to master before this ticket, and in the flame graph updateStopTimesInDB is much smaller:

At this point the code is a little simpler, too.

This code though operates at the level of a single trip. I wonder to get a really great update improvement do we need to batch stop time updates across all trips? This would make this code potentially more complex, but the savings could be awesome.

By the way, a nice thing about playing around with these optimizations is that the testing coverage is quite good, so there is not much concern about introducing regressions.

jamespfennell commented 1 year ago

The last flamegraph isn't quite right - it turns out there's another goroutine that handles "copy from" operations, and it's reporting taking a lot of time. The really weird thing is that all of the time seems to be taken with constructing errors, even though no errors are reported:

jamespfennell commented 1 year ago

Looking through the source code of the pgx driver, it seems the function encodePreparedStatementArgument is no longer in mainline. In fact, it seems that the "copy from" part of pgx changed a lot between v4 (what Transiter is using) and v5 (the most recent pgx version). I think all the dependencies of Transiter should be updated before doing more performance improvements.

jamespfennell commented 1 year ago

With the dependencies updated and using the previous commit mentioned above (not yet merged), the results here look pretty nice. I think I'm seeing about a 60% performance improvement overall, both from looking at the time taken for each update (has gone from ~8 seconds to ~3 seconds) and from the total time spent in the update function in pprof (from ~16 seconds to 6 seconds, over a 300 second profile).

jamespfennell commented 1 year ago

The current state of play on this branch:

The main 3 things we can speed up here are:

Making the trip hashing algorithm better (common.HashValue). Right now it marshals the trip to JSON and takes a hash of the bytes; presumably this would be much faster if we just hashed the struct directly somehow.
Batching UpdateTrip (second column under realtime.(*stagedUpdates).run).
Batching MarkStopTimesPast (first column under realtime.(*stagedUpdates).run).

Both of these batchings are a little trickier because we can't delete the old row and reinsert it. We may need to write some custom pgx driving code around a UPDATE FROM SQL query.

Other than these 3, I think we're at the limit of what we can do, at least in idiomatic Go. Looking through other parts of the flamegraph, non-trivial time is being spent in e.g. map accesses. That this is the bottleneck is a good sign that we've optimized the code well I think.

jamespfennell commented 1 year ago

So turns out sqlc/pgx have a really simple way to do batch updates using :batchexec instead of :exec. Using this makes the two SQL queries mentioned in the last comment negligible in the flamegraph. The last part is optimizing the hash calculation - this calculation is taking up 25% of the update time right now!

jamespfennell commented 1 year ago

I did a proof-of-concept of a custom trip hasher in https://github.com/jamespfennell/transiter/commit/bfb910c7b8172abdd13657397a734831dcbb0b20 and, as expected, performance is much better, another 25% reduction in the time taken to do updates. The code is not crazy bad, so I'm going to commit it to the GTFS library and use it. Then this bug will be finished.

One other thing I've noticed in all of the profiles is a long time being spent in garbage collection. The GTFS library creates a lot of pointers and maybe this is why the GC is so busy? Now that generics has landed, all of these pointers could be replace with an optional type which would be much more GC friendly. This would be a big breaking change.

jamespfennell commented 1 year ago

I just took a "heap profile" (first time every doing that!) and accordingly to the inuse_space, 70% of the memory is being allocated in the GTFS realtime parser. The profile seems a little janky though and is only showing a few functions:

There is also alloc_space in which the distribution follows the CPU distribution.

Edit: my understanding of the profile is incorrect here. The profile is just showing live memory, and so doesn't tell us about the allocation profile of the program.

jamespfennell commented 1 year ago

I just committed an update to the GTFS library, which incorporates a much faster hash function for trips (https://github.com/jamespfennell/gtfs/commit/02cdcf2f4959fbca1708661502c2caf72c2c9af8). As expected, CPU usage in the update process decreased again.

I rolled out all of the changes in this issue to demo.transiter.dev, and am seeing an awesome reduction in the amount of time taken to do updates for the NYC subway numbered trains:

The graph is using a 5 minute sliding window, which is why the decrease is smooth. Note that the bottom axis is 0.2 seconds. The average update time thus dropped from ~800ms to ~350ms! On my CPU monitoring, this seems to have reduced Transiter's CPU usage by 10% of a vCPU (maybe about a 50% reduction, but I'm not sure because the node is multi-tenanted with other projects I run).

Finally, the last profile I took shows maybe 45% of the CPU time being spent GCing memory. I think this is potentially expected because with these optimizations Transiter is highly IO bounded. But we can revisit this later of course.

jamespfennell / transiter

Optimize the realtime trips updater #100