kuanb / peartree

peartree: A library for converting transit data into a directed graph for sketch network analysis.
MIT License
201 stars 23 forks source link

Why handle `direction_id` being NaN and no `direction_id` differently? #89

Closed yiyange closed 6 years ago

yiyange commented 6 years ago

When working with a GTFS feed from Beaumont, CA, pt.paths.generate_summary_graph_elements erred and logged as below:

InsufficientSummaryResults: The target time frame returned no valid edge costs from feed object.

It turns out that it is because trips_and_stop_times generated in route_analyzer.generate_route_costs() has a direction_id but it is all NaN. Consequently, all average wait based on headway for each stop is NaN.

Has direction_id but it is all Nan

screen shot 2018-07-12 at 9 37 43 am

As a result, no valid wait time is generated.

screen shot 2018-07-12 at 9 36 18 am

However, the logic in generate_wait_times() makes a different decision about trips_and_stop_times having no direction_id

It just uses the all trips, no matter what direction, to get wait time at a given stop.

def generate_wait_times(trips_and_stop_times: pd.DataFrame
                        ) -> Dict[int, List[float]]:
    wait_times = {0: {}, 1: {}}
    for stop_id in trips_and_stop_times.stop_id.unique():
        # Handle both inbound and outbound directions
        for direction in [0, 1]:
            # Check if direction_id exists in source data
            if 'direction_id' in trips_and_stop_times:
                constraint_1 = (trips_and_stop_times.direction_id == direction)
                constraint_2 = (trips_and_stop_times.stop_id == stop_id)
                both_constraints = (constraint_1 & constraint_2)
                direction_subset = trips_and_stop_times[both_constraints]
            else:
                direction_subset = trips_and_stop_times.copy()

            # Only run if each direction is contained
            # in the same trip id
            if direction_subset.empty:
                average_wait = np.nan
            else:
                average_wait = calculate_average_wait(direction_subset)

            # Add according to which direction we are working with
            wait_times[direction][stop_id] = average_wait

    return wait_times

It seems that direction_id being NaN and having no direction_id are essentially the same. Can we also copy trips_and_stop_times for the former case?

yiyange commented 6 years ago

Similar question re: handling direction_id column in: https://github.com/kuanb/peartree/blob/8e4b6a71e22b69ac5f2b7f4a2ff5c7b5b92a2b88/peartree/parallel.py#L142-L149

kuanb commented 6 years ago

It seems like if there are multiple points at which this decision needs to be made, then the decision should happen upstream. That is, at the very beginning of this processing, if there is not 100% coverage of direction ID, then we should remove/drop the column entirely.

yiyange commented 6 years ago

@kuanb How about this ^^^

kuanb commented 6 years ago

Fixed by https://github.com/kuanb/peartree/pull/90