Ignore invalid feed columns

elad661 commented 8 years ago

Some transit operators add non-standard feed columns (which don't exist in the Google Extensions or in the official specs) to their feeds. for example MBTA (this feed: http://www.mbta.com/uploadedfiles/MBTA_GTFS.zip) adds "route_sort_order" to routes.txt

This causes pygtfs to fail, because it doesn't know what to do with these values.

Failure while writing Routes(route_id='Blue', agency_id='1', route_short_name='', route_long_name='Blue Line', route_desc='Rapid Transit', route_type='1', route_url='', route_color='2F5DA6', route_text_color='FFFFFF', route_sort_order='1')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.4/site-packages/pygtfs-0.1.2-py3.4.egg/pygtfs/loader.py", line 76, in append_feed
  File "<string>", line 4, in __init__
  File "/usr/lib64/python3.4/site-packages/sqlalchemy/orm/state.py", line 306, in _initialize_instance
    manager.dispatch.init_failure(self, args, kwargs)
  File "/usr/lib64/python3.4/site-packages/sqlalchemy/util/langhelpers.py", line 60, in __exit__
    compat.reraise(exc_type, exc_value, exc_tb)
  File "/usr/lib64/python3.4/site-packages/sqlalchemy/util/compat.py", line 184, in reraise
    raise value
  File "/usr/lib64/python3.4/site-packages/sqlalchemy/orm/state.py", line 303, in _initialize_instance
    return manager.original_init(*mixed[1:], **kwargs)
  File "/usr/lib64/python3.4/site-packages/sqlalchemy/ext/declarative/base.py", line 648, in _declarative_constructor
    (k, cls_.__name__))
TypeError: 'route_sort_order' is an invalid keyword argument for Route

It'd be better if these kind of errors will be ignored, so pygtfs will be usable even with not-exactly-standard feeds

elad661 commented 8 years ago

It seems that one of the forks of this library has a fix for this,

nathanhilbert/pygtfs_atx@0c52b07d98aaf9a677dc63e09544d5014e8dc549

@nathanhilbert I think it'd be useful if you could create a pull request with this fix (and your sqlalchemy warning fix) so other people will be able to make use of them as well

nathanhilbert commented 8 years ago

I'm always happy to contribute a PR when I can. I thought this solution would add too much time to loading https://github.com/nathanhilbert/pygtfs_atx/commit/0c52b07d98aaf9a677dc63e09544d5014e8dc549#diff-37d894406fb18d5a282b4ac0a09aee96R77. Are there any ideas for making it a little less brute force?

elad661 commented 8 years ago

I thought about it for a bit and found a better solution:

it's possible to look at the header (the first line, with the column names) before reading everything and only filter if the header has a column of an unknown type. And since the names of the unknown fields are already known there's no need to loop over the entire dict for every row read from the csv.

I implemented it and it does seem a bit faster. I'll send a pull request.

jarondl / pygtfs

Ignore invalid feed columns #18