Reproducible download of stationdata

I downloaded the zips from the endpoint (which tubeulator should likewise be able to do… but actually doesn’t, presumably because the names for StopPoint are hardcoded and stationdata is discarded?)

Get that schema codegen’d
Add a step that reproducibly calls the downloaders and then receives the data and stores it, unpacks it, and renames as needed to fit the existing format (I think gtfs was .txt suffixed and I changed to .csv, the detailed dataset was fine)
- https://github.com/lmmx/tubeulator/blob/master/src/tubeulator/data/stationdata/gtfs/
- https://github.com/lmmx/tubeulator/blob/master/src/tubeulator/data/stationdata/detailed/

This could then be extended to do things like make a graph of the network

So previously we only cared about the components/schemas path in the schema, whereas here the schema doesn't have any, it has paths which indicate where to go pull stationdata datasets from

(tubeulator) louis 🚶 ~/dev/tubeulator $ tubeulator populate
> /home/louis/dev/tubeulator/src/tubeulator/utils/paths.py(60)load_endpoint_component_schemas()
-> endpoint_schema = load_endpoint_schema(schema_name)
(Pdb) n
> /home/louis/dev/tubeulator/src/tubeulator/utils/paths.py(61)load_endpoint_component_schemas()
-> component_schemas = endpoint_schema["components"].get("schemas", {})
(Pdb) n
> /home/louis/dev/tubeulator/src/tubeulator/utils/paths.py(62)load_endpoint_component_schemas()
-> return component_schemas
(Pdb) pp endpoint_schema
{'components': {'securitySchemes': {'apiKeyHeader': {'in': 'header',
                                                     'name': 'app_key',
                                                     'type': 'apiKey'},
                                    'apiKeyQuery': {'in': 'query',
                                                    'name': 'app_key',
                                                    'type': 'apiKey'}}},
 'info': {'description': '', 'title': 'Station Data', 'version': '1.0'},
 'openapi': '3.0.1',
 'paths': {'/tfl-stationdata-detailed.zip': {'get': {'description': 'TfL '
                                                                    'station '
                                                                    'data '
                                                                    'detailed',
                                                     'operationId': 'detailed',
                                                     'responses': {'200': {'description': ''}},
                                                     'summary': 'TfL station '
                                                                'data '
                                                                'detailed'}},
           '/tfl-stationdata-gtfs.zip': {'get': {'description': 'TfL station '
                                                                'data gtfs '
                                                                'files',
                                                 'operationId': 'gfts',
                                                 'responses': {'200': {'description': ''}},
                                                 'summary': 'TfL station data '
                                                            'gtfs'}}},
 'security': [{'apiKeyHeader': []}, {'apiKeyQuery': []}],
 'servers': [{'url': 'https://api.tfl.gov.uk/stationdata'}]}

It would make sense to populate these into Pydantic models now and work less chaotically.

The code is full of crap like this:

@cache
def load_endpoint_schema(schema_name: str):
    """Load an entire JSON schema for an API endpoint by its name, e.g. "Line" or "Mode"."""
    endpoint_schema = json.loads(Path(find_schema_by_name(schema_name)).read_text())
    return endpoint_schema

@cache
def load_endpoint_component_schemas(schema_name: str) -> dict[str, dict]:
    """Load all component schemas of a JSON schema for an API endpoint by endpoint name."""
    if schema_name == "stationdata":
        breakpoint()
    endpoint_schema = load_endpoint_schema(schema_name)
    component_schemas = endpoint_schema["components"].get("schemas", {})
    return component_schemas

Note that the cache here is standing in for the "parse once" principle (when working with Pydantic)

lmmx / tubeulator

Reproducible download of stationdata #31