remix / partridge

A fast, forgiving GTFS reader built on pandas DataFrames
https://partridge.readthedocs.io
MIT License
150 stars 22 forks source link

Non-numeric IDs #68

Closed timhowgego closed 1 year ago

timhowgego commented 3 years ago

Description

As is, it appears that any non-numeric ID field will crash Partridge because Pandas is expecting IDs to be integers. Unfortunately the GTFS spec states that a valid ID "is a sequence of any UTF-8 characters", and in practice the use of non-numeric IDs is widespread.

What I Did

import partridge
partridge.read_busiest_date("gtfs.zip")
# For example - issue is not limited to this function

Traceback (most recent call last):
  File "pandas\_libs\parsers.pyx", line 1050, in pandas._libs.parsers.TextReader._convert_tokens
TypeError: Cannot cast array data from dtype('O') to dtype('int32') according to the rule 'safe'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<pyshell#9>", line 1, in <module>
    partridge.read_busiest_date("gtfs.zip")
  File "...\partridge\readers.py", line 60, in read_busiest_date
    return _busiest_date(feed)
  File "...\partridge\readers.py", line 118, in _busiest_date
    service_ids_by_date = _service_ids_by_date(feed)
  File "...\partridge\readers.py", line 156, in _service_ids_by_date
    service_ids = set(feed.trips.service_id)
  File "...\partridge\gtfs.py", line 16, in getter
    return self.get(filename)
  File "...\partridge\gtfs.py", line 48, in get
    df = self._read(filename)
  File "...\partridge\gtfs.py", line 48, in get
    df = self._read(filename)
  File "...\partridge\gtfs.py", line 102, in _read_csv
    df = pd.read_csv(path, dtype=np.unicode, encoding=encoding, index_col=False)
  File "...\pandas\io\parsers.py", line 605, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "...\pandas\io\parsers.py", line 463, in _read
    return parser.read(nrows)
  File "...\pandas\io\parsers.py", line 1052, in read
    index, columns, col_dict = self._engine.read(nrows)
  File "...\pandas\io\parsers.py", line 2056, in read
    data = self._reader.read(nrows)
  File "pandas\_libs\parsers.pyx", line 756, in pandas._libs.parsers.TextReader.read
  File "pandas\_libs\parsers.pyx", line 771, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas\_libs\parsers.pyx", line 850, in pandas._libs.parsers.TextReader._read_rows
  File "pandas\_libs\parsers.pyx", line 982, in pandas._libs.parsers.TextReader._convert_column_data
  File "pandas\_libs\parsers.pyx", line 1056, in pandas._libs.parsers.TextReader._convert_tokens
ValueError: invalid literal for int() with base 10: 'offending_string_id'
invisiblefunnel commented 3 years ago

Thanks for posting this issue @timhowgego. Can you help narrow down the problem by figuring out which file, column, and values in your particular GTFS file are failing to parse? Partridge doesn't convert GTFS ID columns to integers. You can see the default "converters" for each file in config.py.

BlackSpade741 commented 3 years ago

I've been encountering the same problem as well since ~2ish days ago (worked flawlessly before this), the problem seems to be with the route IDs in my GTFS feed. Maybe it is one of your dependencies?

timhowgego commented 3 years ago

I built a bare bones empty GTFS file - basically common headers only, no data - and then proceeded to add dummy data.

With one data line in trips.txt (all IDs as strings and other files header-only), the error triggers at the first column, which was route_id. Upon setting the route_id data to an integer, the error shifts to the next column, service_id. Set service_id to an integer and the error shifts to the third column, trip_id.

However, with all IDs as integers, the error moves to trip_short_name. So it looks like any variable that isn't an integer anywhere in the data will trigger this error. I'd mistakenly concluded this was ID issue because ID fields tend to occur first in GTFS rows. Apologies.

Dependencies may be at issue. Today I specifically uninstalled and reinstalled pandas and numpy (via pip) to fix an unrelated issue. My version reports 1.2.1 for pandas (11 days old) and 1.20.0 for numpy (released yesterday).

invisiblefunnel commented 3 years ago

Maybe it is one of your dependencies?

I think you are right.

the error triggers at the first column

~Possibly related to the index_col kwarg to pandas.read_csv?~ nope, see below

invisiblefunnel commented 3 years ago

It appears a bug was introduced in numpy when deprecating type aliases. The bug caused np.unicode to be aliased to np.compat.long which is an alias for int. This was quick to track down because of a typo in the deprecation message (unciode). Details: https://github.com/numpy/numpy/pull/18230/files#diff-09a2c2e86cca91dac0c336e4dad47197ba3cd837ff6c85abf8525cc3b879d3e6R213-R216

Here's a reproduction of the issue in pandas: https://gist.github.com/invisiblefunnel/a896921083f8cf03843e72da8dba783f.

I will file a bug with numpy if someone else can confirm my assessment. Meanwhile, is anyone interested in updating partridge to use str instead of np.unicode?

BlackSpade741 commented 3 years ago

I will file a bug with numpy if someone else can confirm my assessment.

Looks like it! long was incorrectly substituted in this change, before it was using str as the substitution. There's also a bug with aliasing np.long introduced in the same section of code, if you want to mention it.

Meanwhile, is anyone interested in updating partridge to use str instead of np.unicode?

I can take a crack at updating it to str to unblock myself :) Seems like it is just a few simple changes.

invisiblefunnel commented 3 years ago

There's also a bug with aliasing np.long introduced in the same section of code, if you want to mention it.

What's the bug with np.long?

BlackSpade741 commented 3 years ago

Actually on second look it should be fine. Cheers.

invisiblefunnel commented 3 years ago

Numpy bug report https://github.com/numpy/numpy/issues/18287

timhowgego commented 3 years ago

@BlackSpade741 Obviously not my baby, but on casual reading your changes look fine to me. My only suggestion would be to also switch other depreciated types to a native Python type, simply to eliminate redundant (yet evidently breakable) dependencies:

BlackSpade741 commented 3 years ago

@timhowgego Good point! I'll take a look later :)

BlackSpade741 commented 3 years ago

Seems like number types with specific precisions (specifically np.float64 and np.int64 that we are using in test_readers.py) are not being deprecated. Curious if you folks think it is better to keep these or move to the more general int.

invisiblefunnel commented 3 years ago

Better to scope your changes to deprecated type aliases 👍 .

BlackSpade741 commented 3 years ago

Updated the PR to include np.object as well!