jamespfennell / transiter

Web service for transit data
https://demo.transiter.dev
MIT License
62 stars 7 forks source link

System deletion is really slow #103

Closed jamespfennell closed 1 year ago

jamespfennell commented 1 year ago

I have a Transiter instance with 3 transit systems that's been up for less than 24 hours. I tried to delete the NYC subway system and it took so long that I cancelled the deletion. Based on subsequent debugging, I estimate that it would have taken at least 2 minutes to delete the system - and much, much longer if this was a real long-running Transiter instance. For context, my expectation is that the delete call should be O(100ms).

I have root caused the issue. For system deletion, Transiter relies on Postgres ON CASCADE DELETE triggers. This is awesome because it makes the code really simple - we just have to delete the system row, and then Postgres ensures all other system data is deleted. However, it turns out this query can be slow - 2 minutes in this case.

As part of the system deletion, Postgres iterates over all feed updates and deletes them one by one. There are about 10 database tables (stop, route, trip) that have source_pk columns that foreign key into the feed updates table. To delete a feed update, Postgres needs to identify rows in these tables to cascade the deletion to. There are then two problems:

  1. The source_pk foreign keys are not indexed! In fact, many of Transiter's foriegn keys aren't. This is because I thought Postgres automatically generates indexes for foreign keys, but this is wrong. Thus to check for, say, stops that foriegn key into a particular feed update, Postgres has to do a full table scan of the stops table. With multiple transit system installed this is really slow.

  2. Postgres repeats this scan for every single feed update! Even if we add indexes, the deletion is still sort of slow (5 seconds) because Postgres does an index lookup for each update.

Based on this analysis, I think this slowness problem also affects the feed update GC because it will do a lookup in all the resource tables for each update it elects to delete.

That's the analysis; next, solutions.

Problem (1) is easy to solve: we need to ensure all foreign keys have indexes.

Problem (2) is trickier because this is just how Postgres deletes rows. However for feed update GC, we can do the following. We will change the GC query to be of the form:

DELETE FROM feed_update WHERE
NOT IN (
  SELECT source_pk FROM stop
  UNION
  SELECT source_pk FROM route
  UNION
  ...
)
AND <other conditions>;

With this, all of the scans disappear. This is because Postgres first performs the (fast) inner queries SELECT source_pk FROM stop. Then when deleting a specific feed update row, it knows it doesn't need to check the stops table. Essentially, we're instructing Postgres to scan the stops table once, and then use the results of the scan for all of the deletions.

For system deletions, I think we can just change the deletion code to first GC all of the feed updates for all of the feeds. Then the actual deletion should be very fast.