Look into speeding up feed imports

araichev commented 10 years ago

I think it would be worthwhile for someone, possibly me, to spend some time figuring out how to speed up feed imports. As it stands, a 165MB feed (21MB zipped) such as Southeast Queensland's (http://translink.com.au/sites/default/files/assets/resources/about-translink/reporting-and-publications/open-data/gtfs/SEQ.zip) takes about 9--12 hours to load into a PostGIS database. (I'll time it more accurately next time i do the import.) Is that slow? Seems like it, but i don't know much about databases.

Permafacture commented 10 years ago

thanks for the heads up. I decided to load my city's bus information just to test this app and now I know it's going to take a loooong time. A comment about this in the README would be nice.

Permafacture commented 10 years ago

Without profiling it, my guess would be inserting into the database after every row of every file (base.py: cls.objects.create(**fields))

this project (https://github.com/cbick/gtfs_SQL_importer) just builds a giant insert statement and combines it with sql. I don't think this would work out of the box for django-multi-gtfs, but maybe it could be reworked? Or rework multi-gtfs to fit gtfs_SQL_importer? I guess comparing the performance of the two would be a good first step...

jwhitlock commented 10 years ago

I suspected this would be required. Thanks for the gtfs_SQL_importer tip, I'll see how they do it, and see if there is some common ground.

Some other ideas:

Turn off indexes before import, re-enable after import
Stop using DecimalField
Remove some belts-and-suspenders indexes, rely on validators for consistency checking

Permafacture commented 10 years ago

Disclaimer: I know squat about databases. I'm just going off stack overflow here.

I think before doing anything fancy, that batching the INSERTs will have the biggest effect. Here is an interesting decorator that might help: https://docs.djangoproject.com/en/dev/topics/db/transactions/#django.db.transaction.commit_on_success

The next step if that is still very slow seem to be to drop all foreign key constraints before import and recreate them after. So long as there is still a column with an integer that just happens to be the primary key of the row in another table you want a relation to, I think this should be fine. This would be done with South: http://south.readthedocs.org/en/latest/databaseapi.html#db-delete-foreign-key and http://south.readthedocs.org/en/latest/databaseapi.html#db-alter-column

I don't know if django will have trouble (like, with building m2m tables) if south is playing with the schema behind django's back.

I get the impression that turning off indexes (indices?) will make inserting slower by making the foreign key constraint much slower to check. Or maybe turning off indices and disabling foreign key constraints amount to the same thing. I have no idea.

But, "optimization before profiling is always premature".

codelahoma commented 10 years ago

Considering only I/O, since I really doubt this is a computation bound problem, my understanding is as follows (generally, for databases).

Since indices involve both reads and writes, turning them off should give a bigger boost than disabling foreign key constraints, which are reads.

That said, if the tables containing the foreign keys are large enough, turning off the indices could cause the constraints to slow down, too.

Batching should help a lot, since it takes the performance of overall request overhead from O(n) to O(n/batch_size).

Apologies if I'm abusing big O notation there, if what I say is obvious, and/or if what I say is obviously wrong.

Edit: Made it clearer I'm referring to aggregate overhead, not per call.

jwhitlock commented 10 years ago

Blocked by #10 - Southeast Queensland's feed contains additional columns that have to be hand-removed to get the feed into multigtfs.

Permafacture commented 10 years ago

Sorry, what is blocked? I don't see how extra columns affect batching or turning off constraints/indicies.

Elliot On Jun 11, 2014 10:40 AM, "John Whitlock" notifications@github.com wrote:

Blocked by #10 https://github.com/tulsawebdevs/django-multi-gtfs/issues/10 - Southeast Queensland's feed contains additional columns that have to be hand-removed to get the feed into multigtfs.

— Reply to this email directly or view it on GitHub https://github.com/tulsawebdevs/django-multi-gtfs/issues/28#issuecomment-45758746 .

jwhitlock commented 10 years ago

Here's my desired process:

Time importing Southeast Queensland's Feed
Make some changes
Time importing Southeast Queensland's Feed. Did the changes make a big enough difference? Save the changes. Did they make little difference or have a negative effect? Try something else.

Because Southeast Queensland's feed includes extra columns (#10), I can't do step 1.

So, blocked is probably the wrong word. I could use a different feed, optimize, and just tell araichev 'works for me'. But, I'm going to work on #10 first, so that I can compare apples to apples (or South Queensland to South Queensland).

Permafacture commented 10 years ago

That makes sense, though it might make more sense to use a smaller benchmark rather than wait 9 hours to see if there was an improvement. Also, araichev didn't complain about not being able to import at all, so fixing #10 isn't exactly related to this.

araichev commented 10 years ago

Yeah, after fixing issue #10, i would test a small feed first, such as Cairns (739 KB zipped; http://translink.com.au/sites/default/files/assets/resources/about-translink/reporting-and-publications/open-data/gtfs/cairns.zip), then move on to a slightly bigger one (<= 10 MB zipped; any suggestions?), then finally a big one, such as Southeast Queensland (21.1 MB zipped; http://translink.com.au/sites/default/files/assets/resources/about-translink/reporting-and-publications./open-data/gtfs/SEQ.zip). A feed of intermediate size (any suggestions

jwhitlock commented 10 years ago

Issue #10 is fixed, so I'm able to import some feeds. I re-discovered that DEBUG=True can hurt you (see my blog post). My times are:

SEQ.zip (21 MB) - 12 hours
Cairns.zip (740 KB) - 31 minutes
Tulsa Transit (1.3 MB) - 28 minutes

As you can see from the Tulsa Transit vs Cairns number, the content of the feed matters more than the absolute size. Here's the timing breakdown for SEQ:

12776 Stops in 63 seconds, or 202 / sec
525 Routes in 0.7 seconds, or 750 / sec
101 Services in 0.1 seconds
154 Service Dates in 0.3 seconds
715557 Shape Points in 31909 seconds (8.5 hours), or 22 / sec
85724 Trips in 385 seconds, or 300 / sec
2512036 StopTimes in 9416 seconds (2.5 hours), or 266 / sec
30 - 60 minutes to add geometries to trips, routes, etc.

So, if a feed has a lot of detailed shapes, it will import super slowly.

I was worried that geometry columns would be the biggest problem. There is certainly a difference between rows with geometry, such as Stops, and rows without geometry, such as Routes. However, I think the biggest difference is rows that refer to other types, such as trips belonging to a route. So, with a much smaller test feed, I'm going to try:

Caching IDs of inserted rows, to avoid database lookups for related rows
Bulk importing of rows

I'm hopeful that small steps can get SEQ down below 2 hours. Not great, but back in long lunch territory.

araichev commented 10 years ago

Nice work profiling, John. Let's see how the optimizations go.

jwhitlock commented 10 years ago

I've pushed code that speeds up imports and exports. Here's the numbers for my laptop and a local PostGIS database:

SEQ - 41 minute import (17x speed-up), 11 minutes export
Cairns - 88 second import (21x speed-up), 30 second export
Tulsa - 77 second import (21x speed-up), 25 second export

SEQ was updated a few days ago, so the total item numbers have changed:

12771 Stops in 8.7 seconds, or 1468 / sec
537 Routes in 0.3 seconds
735419 Shape Points in 331.5 seconds, or 2218 / sec
76540 Trips in 37.6 seconds, or 2035 / sec
2248646 Stop Times in 575.3 seconds, or 3908 / sec
99 seconds to create cached Shape geometries
298 seconds to create cached Trip geometries
1130 seconds to create cached Route geometries

That's 16 minutes to import the data, and then 25 minutes to update the cached geometries. I can't see a way to speed this up. Maybe someone with more PostGIS experience will see a solution.

It's still memory intensive to import or export a feed as big as SEQ.zip. I wouldn't attempt it with less than 4 GB of RAM.

jwhitlock commented 10 years ago

By the way, thanks for the suggestions. About 1/3 of the import speedup was from bulk inserts, and 2/3 from pre-caching database reads during import. I didn't try other options, like temporarily removing indexes or changing field types. There may be additional improvements, but this feels good enough for now.

Permafacture commented 10 years ago

41 minutes is a heck of a lot more useable than 12 hours. Good job.

araichev commented 10 years ago

Yeah, that's great!

tulsawebdevs / django-multi-gtfs

Look into speeding up feed imports #28