integrity tests and fixes for sequences and circular lines

Hi there, for PT networks of hundreds of thousands to millions of links, quetzal's integrity check functions integrity_test_sequences() and integrity_test_circular_lines() take an indefinite long time (I had to interrupt the last test with 2 million links after one day). This is why I suggest some faster logic:

The sequence testing only accounts for the length of the trip, which might overlook situations like 1-->2-->2-->4, but that is less probable (does not occur in my GTFS feeds):

def test_sequences(trip):
    assert len(trip)==trip['link_sequence'].max(), \
        'broken sequence in trip {}'.format(trip['trip_id'].unique()[0])
self.links.groupby('trip_id').apply(test_sequences)

The circular lines test should account for any case where duplicate stops occur within one trip:

def test_circular(trip):
    if len(set(list(trip['a'])+list(trip['b']))) != len(trip)+1:
        return trip
self.circular_lines = self.links.groupby('trip_id').apply(test_circular).reset_index(level='trip_id', drop=True)

On the other hand, the fix methods are a bit too fast, dropping all affected trips. I would suggest a thorough fix by splitting up trip_id's, knowing, that this causes in additional interchanges. That does not represent reality, but is better than dropping trips, when their number is considerable.

A suggestion for trip sequences:

def fix_sequences(trip):
    if len(trip) > 1:
        trip = trip.sort_values('link_sequence')
        # Check link succession
        ind = list(trip.index)
        for i in range(len(trip.index) - 1):
            try:
                assert trip.loc[ind[i], 'b'] == trip.loc[ind[i+1], 'a'], \
                    'broken trip {}: stop {} has no successor link'.format(
                        trip['trip_id'].unique()[0], trip.loc[ind[i], 'b'])
            except AssertionError:
                trip.loc[ind[i+1]:ind[-1], 'trip_id'] = \
                    trip.loc[ind[i+1]:ind[-1], 'trip_id'] + '_' + str(i)
        # Repair sequences
        if len(trip) != trip['link_sequence'].max():
            trip['link_sequence'] = trip.groupby('trip_id')['link_sequence'].apply(
                lambda t: [j for j in range(1, len(t.index)+1)]).sum()
    return trip
self.links = self.links.groupby('trip_id').apply(fix_sequences).reset_index(level=0, drop=True)

My suggestion for circular lines fixes 97% of circularity the issues:

def fix_circular_split(trip):
    def split_trip(trip, split_by):
        split = [trip.index.get_loc(i) for i in trip.loc[trip[split_by].duplicated(keep=False)].index]
        if len(split) >= 1:
            trips = []
            # First stops
            trips.append(trip.iloc[: split[0]+1])
            # Middle stops
            for i in range(1, len(split)):
                t = trip.iloc[split[i-1]+1 : split[i]]
                t['trip_id'] = t['trip_id'] + '_' + str(i) + str(split_by)
                t['link_sequence'] = list(range(1, len(t)+1))
                trips.append(t)
            # Last stops
            t = trip.iloc[split[-1] :]
            t['trip_id'] = t['trip_id'] + '_n' + str(split_by)
            t['link_sequence'] = list(range(1, len(t)+1))
            trips.append(t)
            return pd.concat(trips)
        else:
            return trip
    # Split duplicated b stops
    trip = split_trip(trip, 'b')
    # Split duplicated a stops
    trip = trip.groupby('trip_id').apply(split_trip, 'a')
    return trip
fixed = self.circular_lines.groupby('trip_id').apply(fix_circular_split).reset_index(level='trip_id', drop=True)
initial_circular = self.circular_lines.copy()
fixed.groupby('trip_id').apply(test_circular).reset_index(level='trip_id', drop=True)
fixed.drop(self.circular_lines.index, inplace=True)
self.links = self.links.loc[~sm.links['trip_id'].isin(initial_circular['trip_id'].unique())]
self.links = self.links.append(fixed)

It's all tested with the PT network of entire Germany. I hope I made no mistakes translating the logic it into quetzal function suggestions.

I would suggest keeping the current methods, but including an option for "quick-checks" and "thorough-fixes". Cheers

systragroup / quetzal

integrity tests and fixes for sequences and circular lines #84