itinero / GTFS

.NET implementation of a General Transit Feed Specification (GTFS) feed parser.
http://www.itinero.tech
MIT License
69 stars 45 forks source link

Optimization idea for faster lookups #25

Open rhwilburn opened 8 years ago

rhwilburn commented 8 years ago

I had an idea today when trying to work in-memory with GTFS that it can make sense to do look ups, as I have seen in the code. The look ups however are likely (and I am yet to test an approach here) inefficient as they use LINQ First() method. I think putting things in a dictionary might make alot of sense here, at least for finding items by Id. Currently my app takes 17 minutes to run, so I am seeing if I can bring that down a bit. I will let you know if I find any key information here.

rhwilburn commented 8 years ago

I have confirmed doing this has optimized my program down from 17 minutes to 2.5 minutes.

My work around was to do the following:

var stopTimesByTrips = feed.StopTimes.GroupBy(stopTime => stopTime.TripId).ToDictionary(stopTimeGrouping => stopTimeGrouping.Key, stopTimeGrouping => stopTimeGrouping.OrderBy(shape => shape.StopSequence).ToList()); var shapesDictionary = feed.Shapes.GroupBy(shapes => shapes.Id).ToDictionary(shapeGrouping => shapeGrouping.Key, shapeGrouping => shapeGrouping.OrderBy(shape => shape.Sequence).ToList()); var tripsDictionary = feed.Trips.ToDictionary(trip => trip.Id); var routesDictionary = feed.Routes.ToDictionary(route => route.Id);

Maybe this sort of idea could be provide through extension methods?

xivk commented 8 years ago

I meant the in-memory db just for reference not so much for actual applications but I also built some code using it by now. Maybe it's better to use a database for your app or find some .NET in-memory db implementation we can use? I'm a bit reluctant to implement some of these features because these things can be fixed by using a database.

rhwilburn commented 8 years ago

I can see the arguments towards putting it a formal database (especially with regards to allowing DBs to cater for efficient indexing; I mean who wants to implement something that competes with the 100s of databases out there unless explicitly aiming to do so) however I can equally see the arguments for efficient code. I think my bias in suggesting this is that lists are being used where dictionaries should be because you are continuously structuring the code in a way that it is doing look ups.

With that said, you could argue that you might need dictionaries for many things other than just ID. You could also argue that people might not be doing lookups very often either. My were lookups generating a 2 gig GraphML database import file from 30mb zip of GTFS data (about a 1.5 million person city). I really want to convert in my use case, not to store in an intermediary persistent format (ie the database) for the one off conversion.

I have a work around so am not bothered as such if you are sure this is not in scope of the project. Feel free to close etc.