Closed abyrd closed 4 years ago
It is somewhat unlikely that this issue is the cause of the error we’re seeing, but the problem I mentioned above (propagation of elapsed times out to samples) does deserve some consideration. A target sample that is in between two main roads served by buses, where those buses pass in an alternating way (first one then the other every few minutes) the effect will be that the minimum travel time is still accurate, but the avg and max are quite wrong. It can pull them up significantly.
Here are some measurements for propagating the travel times all the way out to street vertices and storing them all, since neither histograms nor summary statistics can be properly propagated the "last mile". For a graph made with the full New York Mapzen OSM extract and a 60-minute time window (60 searches) the in-memory size is about 128MB.
Writing out the array using Protobuf CodedOutputStream variable-byte ints, the data takes 35MB on disk, which becomes 10MB if piped through gzip. The array is partly empty because not all vertices are reached, but is filled to the typical 2-hour search extent. Since elapsed times rather than clock times are being stored, it might be a good compromise to keep the values as shorts rather than ints in memory. The output on disk would still be the same size of course due to the varints.
Of course saving 10MB per origin is problematic in New York which has 40k census blocks. This makes for 400GB of output data for a single run, which is just not reasonable. We need to collapse to histograms or summary statistics before saving these to disk, or just never save them and combine them immediately with pointsets before discarding them (which is the current approach) and keeping only one histogram of access at the origin. Indeed, there is perhaps no reason to keep the files since we can produce the data they contain on demand in a few seconds or less per origin.
If the large arrays are always discarded immediately after being propagated to the pointset samples, then it doesn't hurt much for them to be big.
We might want to consider linking pointsets directly to the transit stops (list of stops and distances per point in the pointset). This would save the creation of the timesurface, but I'm not sure how advantageous that would be. Currently we are linking the pointsets to the street vertices, and linking the transit stops to the street vertices, and then splicing everything together (which is a vestige of graph searches that naturally propagated themselves out into the street graph). The advantage of the current approach is that time surfaces are independent of any one point set.
The primary use case for keeping the full 128+MB tables around at least temporarily is displaying arrival time histograms on mouse-over for single-origin searches.
In practice we create a new time surface with every request anyhow.
-- Matthew Wigginton Conway Transportation Analytics/Open SourceWashington, DCindicatrix.org
---- On Sun, 26 Apr 2015 16:57:43 -0400 notifications@github.com wrote ----Of course saving 10MB per origin is problematic in New York which has 40k census blocks. This makes for 400GB of output data for a single run, which is just not reasonable. We need to collapse to histograms or summary statistics before saving these to disk, or just never save them and combine them immediately with pointsets before discarding them (which is the current approach) and keeping only one histogram of access at the origin. Indeed, there is perhaps no reason to keep the files since we can produce the data they contain on demand in a few seconds or less per origin.
If the large arrays are always discarded immediately after being propagated to the pointset samples, then it doesn't hurt much for them to be big. We might want to consider linking pointsets directly to the transit stops (list of stops and distances per point in the pointset). This would save the creation of the timesurface, but I'm not sure how advantageous that would be. Currently we are linking the pointsets to the street vertices, and linking the transit stops to the street vertices, and then splicing everything together (which is a vestige of graph searches that naturally propagated themselves out into the street graph). The advantage of the current approach is that time surfaces are independent of any one point set. —Reply to this email directly or view it on GitHub.
@mattwigway just musing about what form timesurfaces should take when the form that comes naturally from pure graph searches (i.e. the street vertex leaves of a tree) is no longer imposed. TimeSurfaces could be generated rapidly at the current pointset of interest rather than at the street vertices. Not clear yet whether this has much of an advantage.
Note that when making vector isochrones among other things we convert the timesurface at the street vertices into a time surface on a regular grid. There's no reason now we couldn't just make the regular grid a sort of implicit pointset and generate the timesurface directly on that grid. It might make for smoother visualizations when no particular pointset was selected.
Analysis features have been removed from OTP2. Analysis development continues at https://github.com/conveyal/r5. Closing.
Consider a street segment whose two endpoint vertices have the following elapsed travel times for each departure minute:
A) 2,2,2,2,8,8,8,8 B) 8,8,8,8,2,2,2,2
Now we want to evaluate at a point halfway down the street segment, one minute from both endpoints. Currently we are aggregating the results up to min/avg/max at each street vertex, then propagating these out to the TimeSurfaceRangeSet which would give min=3, avg=6, max=9. In reality there is at every departure minute a way to reach the destination in three minutes, so min=avg=max=3.
We could skip the street vertices entirely and keep something like the StopTreeCache but for each point in the destination pointSet. This might however be more voluminous and no faster than one tree per transit stop.
However as @mattwigway has pointed out this will only have an effect when the two ends of the street are reached by different means. The magnitude of the error is limited to the walking time down that street segment.