Hi there,
for PT networks of hundreds of thousands to millions of links, quetzal's integrity check functions integrity_test_sequences() and integrity_test_circular_lines() take an indefinite long time (I had to interrupt the last test with 2 million links after one day). This is why I suggest some faster logic:
The sequence testing only accounts for the length of the trip, which might overlook situations like 1-->2-->2-->4, but that is less probable (does not occur in my GTFS feeds):
On the other hand, the fix methods are a bit too fast, dropping all affected trips. I would suggest a thorough fix by splitting up trip_id's, knowing, that this causes in additional interchanges. That does not represent reality, but is better than dropping trips, when their number is considerable.
A suggestion for trip sequences:
def fix_sequences(trip):
if len(trip) > 1:
trip = trip.sort_values('link_sequence')
# Check link succession
ind = list(trip.index)
for i in range(len(trip.index) - 1):
try:
assert trip.loc[ind[i], 'b'] == trip.loc[ind[i+1], 'a'], \
'broken trip {}: stop {} has no successor link'.format(
trip['trip_id'].unique()[0], trip.loc[ind[i], 'b'])
except AssertionError:
trip.loc[ind[i+1]:ind[-1], 'trip_id'] = \
trip.loc[ind[i+1]:ind[-1], 'trip_id'] + '_' + str(i)
# Repair sequences
if len(trip) != trip['link_sequence'].max():
trip['link_sequence'] = trip.groupby('trip_id')['link_sequence'].apply(
lambda t: [j for j in range(1, len(t.index)+1)]).sum()
return trip
self.links = self.links.groupby('trip_id').apply(fix_sequences).reset_index(level=0, drop=True)
My suggestion for circular lines fixes 97% of circularity the issues:
def fix_circular_split(trip):
def split_trip(trip, split_by):
split = [trip.index.get_loc(i) for i in trip.loc[trip[split_by].duplicated(keep=False)].index]
if len(split) >= 1:
trips = []
# First stops
trips.append(trip.iloc[: split[0]+1])
# Middle stops
for i in range(1, len(split)):
t = trip.iloc[split[i-1]+1 : split[i]]
t['trip_id'] = t['trip_id'] + '_' + str(i) + str(split_by)
t['link_sequence'] = list(range(1, len(t)+1))
trips.append(t)
# Last stops
t = trip.iloc[split[-1] :]
t['trip_id'] = t['trip_id'] + '_n' + str(split_by)
t['link_sequence'] = list(range(1, len(t)+1))
trips.append(t)
return pd.concat(trips)
else:
return trip
# Split duplicated b stops
trip = split_trip(trip, 'b')
# Split duplicated a stops
trip = trip.groupby('trip_id').apply(split_trip, 'a')
return trip
fixed = self.circular_lines.groupby('trip_id').apply(fix_circular_split).reset_index(level='trip_id', drop=True)
initial_circular = self.circular_lines.copy()
fixed.groupby('trip_id').apply(test_circular).reset_index(level='trip_id', drop=True)
fixed.drop(self.circular_lines.index, inplace=True)
self.links = self.links.loc[~sm.links['trip_id'].isin(initial_circular['trip_id'].unique())]
self.links = self.links.append(fixed)
It's all tested with the PT network of entire Germany. I hope I made no mistakes translating the logic it into quetzal function suggestions.
I would suggest keeping the current methods, but including an option for "quick-checks" and "thorough-fixes".
Cheers
Hi there, for PT networks of hundreds of thousands to millions of links,
quetzal
's integrity check functionsintegrity_test_sequences()
andintegrity_test_circular_lines()
take an indefinite long time (I had to interrupt the last test with 2 million links after one day). This is why I suggest some faster logic:The sequence testing only accounts for the length of the trip, which might overlook situations like 1-->2-->2-->4, but that is less probable (does not occur in my GTFS feeds):
The circular lines test should account for any case where duplicate stops occur within one trip:
On the other hand, the fix methods are a bit too fast, dropping all affected trips. I would suggest a thorough fix by splitting up trip_id's, knowing, that this causes in additional interchanges. That does not represent reality, but is better than dropping trips, when their number is considerable.
A suggestion for trip sequences:
My suggestion for circular lines fixes 97% of circularity the issues:
It's all tested with the PT network of entire Germany. I hope I made no mistakes translating the logic it into quetzal function suggestions.
I would suggest keeping the current methods, but including an option for "quick-checks" and "thorough-fixes". Cheers