Pluggable strategy for reading and writing to data sources

t2gran commented 4 years ago

Today OTP reads all its input from the local file system and writes the graph to the same local disk. This is ok in a static deployment, but in a cloud deployment it creates some overhead of moving files from permanent storage into a cluster node and back when the process (OTP) is done. When the OTP process fails or freeze it is difficult to detect, and make the system less robust.

In a continuous automated devops ecosystem we would like OTP to directly integrate with the rest of the system, not having to wrap OTP and copy files around. We want to change the in/out-put files to read and write directly to the cloud storage and to track the progress. This is most relevant when building a graph, but a solution should, if possible, not be limited to that.

At Entur we need to change the current way we do this, so we will implement support for pluggable file access, so the default (current) way this work can be switched to accessing Google Cloud storage.

We post this issue here to let people know we do this in our private fork, and if there is an interest for this, we can make a PR to integrate this into OTP2.

We are not going to introduce dependencies to any Google Cloud specific libraries, just provide a pluggable extension point in OTP to swap in an alternative implementation. We will provide links to our GCS implementation in the Entur GitHub repo, if someone want to copy/use it.

t2gran commented 4 years ago

We are done with this in the Entur 1.x fork, and I will prepare a PR for OTP2 on this. I think this is really useful in a lot of situations. If not configured the implemented solution work the same as today (a few exceptions are listed below). But, by using the build-config.json it is possible to specify URI(s) for each file type (OSM, DEM, GTFS, NETEX, HTML BUILD REPORT(annotations), BASE GRAPH, GRAPH, OTP-STATUS). The otp-status file is new and allow other components to check on the status of the build process - using a synchronization file. We have added support for Google Cloud Storage and using file URLs. It would be easy to add AWS and HTTP support as well - any API witch support a catalog of files that can be streamed will be easy to support.

When zip files are streamed to OTP, OTP keeps the entire file in memory for the duration of processing the file, this is due to the fact that OTP access GTFS and NeTEx data in a random access order. When using the local file system, OTP still uses a random-access-file (ZipFile) to access it - not copying everything to memory. This should not be a problem, but if it is there are simple ways (using a local file cache) to fix it.

I will prepare a PR containing the refactoring of OTP, which create the necessary extension points to make "store plugins". I will also prepare a PR with the Google Cloud plugin as a Sandbox module.

Breaking changes There are a few minor breaking changes:

OTP1 scan the build directory for files and assume all zip-files containing a "stops.txt" file is a GTFS file. I have changed this to look for directories or files containing the string "gtfs" as part of the name. Note that with the new configuration parameters, you can specify the absolute path - regardless of what the name is.
The above change also apply to Netex files.

New features

Support for directories in addition to zip-files.
The ability to specify paths to any resource - independently. The fallback is to use the existing auto-scan of the build directory. If a URI for a given file type is set, but not found, OTP fails hard.
Configuration files (JSON-files) are processed the same way in every situation. In OTP1 this is not consistent. The new JSON parser support:
- environment variable substitution.
- comments
- unquoted keys
OtpStatusFile. When the build process start a file is created otp-status.inProgress, when OTP exit this file is renamed to otp-status.ok or otp-status.failed. In certain ecosystems this make it easier to construct automatic build pipelines. I will do a separate PR for this one.

t2gran commented 3 years ago

Put on hold until 1. June 2021 If there is no demand for the two reminding features before the date, we will close the issue.

The 2 reminding features:

Support for reading and writing to AWS S3 file storage. This is almost implemented, but we do not have a deployment witch uses AWS S3 for graph building, and we do not want to add something that is not used.
Support for a build status file, witch can be used to trigger the next "thing" when OTP graph building is part of a chain-of-processes.

For both of these, if someone need this, and provide the resources to test it, then I can help with providing the implementation. Support for status file PR: #2911.

opentripplanner / OpenTripPlanner

Pluggable strategy for reading and writing to data sources #2891