Open derhuerst opened 5 years ago
Hi @derhuerst, could you please elaborate (maybe through an example) on what kind of queries would you like to support?
In order to increase scalability, the Linked Connections (LC) server interface has been designed to be a simplistic API that only provides documents containing a set of connections
ordered by departureTime
. For this, the only query parameter supported by the server is a departureTime
which the server uses to respond with the document that covers the provided date-time:
https://graph.irail.be/sncb/connections?departureTime=2019-10-04T09:25:00.000Z
On top of that, the server also adds to each LC document some metadata for clients to discover more documents, namely the previous
and next
documents. For this, it uses hydra (a hypermedia vocabulary):
...
"hydra:next": "https://graph.irail.be/sncb/connections?departureTime=2019-10-04T09:43:00.000Z",
"hydra:previous": "https://graph.irail.be/sncb/connections?departureTime=2019-10-04T09:06:00.000Z",
...
The LC server was design mainly for route planning purposes with the Connection Scan Algorithm in mind. We follow the idea behind Linked Data Fragments where a compromise between the workload supported by both servers and clients, may lead to more scalable servers and more flexible data access. However our main interest is to investigate the trade-offs of different Web APIs, so I am certainly interested in the use-case you want to support.
@derhuerst What do you mean by sparse data exactly? A concrete example would help. This repository is mainly intended to host a Linked Connections server from GTFS and GTFS-RT files. You can however also host a Linked Connections compliant API building it in a totally different way.
Thanks for you explanations.
I want to build a Linked Connections endpoint on top of of a sparse data source, e.g. an API where you can fetch departures/connections. This allows me to have experimental support for public transportation networks that don't publish GTFS and/or GTFS-RT feeds.
I am aware that this is terribly inefficient (as the API response time will usually be an order of magnitude higher than file/DB access) and wasteful (as one would often need to query a whole lot more information than just the connections and throw them away after), but as an experiment, I'm interested nonetheless.
What I essentially ask for is splitting the Linked Connections server from the data retrieval logic. This has several benefits in addition to support for sparse data sources:
I want to build a Linked Connections endpoint on top of of a sparse data source, e.g. an API where you can fetch departures/connections. This allows me to have experimental support for public transportation networks that don't publish GTFS and/or GTFS-RT feeds.
I’ve been wondering for a long time whether that would be possible with e.g., HAFAS API responses, but each time I would bump into too many HTTP requests behind a Linked Connections page, as you need to do the matching between the departure and the arrival at the next station. As an experiment it might indeed be interesting nevertheless.
What I essentially ask for is splitting the Linked Connections server from the data retrieval logic. This has several benefits in addition to support for sparse data sources:
The HTTP view code is quite small though, as @julianrojas87 pointed out above. While we agree with the fact that in the future we might support others data sources (most promising: real back-ends from PTOs), we currently have not been able to identify another data source that could be workable.
Would mirroring our HTTP output work for you at this moment? The spec is pretty small: https://linkedconnections.org/specification/1-0
I've written a HAFAS-based prototype at https://github.com/derhuerst/hafas-linked-connections-server .
Can someone of you have a look if the initial direction makes sense? I tried to run the lc-client
CLI against it, but it seems to be stuck in a loop. If you have any requests or comments, can you create an issue over there?
@derhuerst That’s really cool and it does not run too slow either! Really enthusiastic by this.
lc-client hasn’t been further developed for a while (it was the initial prototype). I’ve updated the repo to reflect this. We are however heavily developing Planner.js.
@julianrojas87 @hdelva can we set up on a test server a browser build where you can type in your LC-server (defaults to localhost:3000/connections), and where it automatically calculates a route from stop A to stop B? I’d say: no prefetching and only transfers based on same stop ID (no downloading of routable tiles).
@derhuerst Something lacking is the list of stops and their geo coordinates (indeed not part of the spec, be necessary if we want to visualize it). I’ll open some issues with ideas on your repo!
can we set up [...] a browser build [...] where it automatically calculates a route from stop A to stop B?
Also keep in mind that I need to be able to pick arbitrary locations by myself in order to test this out with my HAFAS-based implementation.
Of course! That’s the reason I opened https://github.com/derhuerst/hafas-linked-connections-server/issues/1
Sorry for the inactivity on this issue. Lots of work travelling combined with some holidays now but will come back in a couple of weeks to complete the implementations.
Since the posts above, I have built gtfs-via-postgres
, yet another tool to import GTFS data into a database. It also adds a connections
view, which AFAIK is semantically very close to a list of lc:Connection
s; It allows keeping all connections stored in the GTFS-style compacted form (not "time-expanded") with reasonably fast queries.
I now want to build a LC server that uses gtfs-via-postgres
underneath, which is why I'm coming back to this thread: I think it would be worth it to isolate the HTTP server & linked data logic (paths, headers, content negotiation, geographic area, ) from the data storage logic and expose it or at least make it re-usable.
In my case, I don't need much of the complexity and dependencies in linked-connections-server
, because I have already downloaded, unzipped and parsed the GTFS, and don't consume a GTFS-RT feed (yet!).
What do you think?
I think it totally makes sense what you propose. The only reason it is bundled altogether is because of the convenience of having one command that does everything and because we were not too aware of Docker back then.
I guess we would need to define a common interface to read the lc:Connection
pages in the same way from gtfs-via-postgres
and from disk.
I guess we would need to define a common interface to read the
lc:Connection
pages in the same way fromgtfs-via-postgres
and from disk.
Yeah, something like abstract-blob-store
(a bit less sophisticated maybe) for Linked Connections!
I'll go ahead and try to come up with such an API in a separate repo, and submit a PR once I've reached something I'm happy with.
Yeah, something like abstract-blob-store (a bit less sophisticated maybe) for Linked Connections!
Yes indeed, I was thinking the same. I had in mind something like abstract-leveldown
.
I'll go ahead and try to come up with such an API in a separate repo, and submit a PR once I've reached something I'm happy with.
That sounds great! thanks for taking it up. I will try to find some time to also start splitting the server in two different modules. But I guess I'll wait for your proposal on the abstract interface to wrap up the data storage half in it.
I'll go ahead and try to come up with such an API in a separate repo, and submit a PR once I've reached something I'm happy with.
That sounds great! thanks for taking it up. I will try to find some time to also start splitting the server in two different modules. But I guess I'll wait for your proposal on the abstract interface to wrap up the data storage half in it.
Yeah, most of the work on my proof-of-concept implementation will be transforming the HTTP/server logic to be data-source-agnostic, so there would be a lot of duplicated work. So if you're fine with that, I'll propose both an API and an express
-based implementation.
Sounds good to me. Please go ahead and I'll jump in once we have your proposal to avoid duplicated work.
Looks like I never gave an update, so I'll do that now, even though I didn't work on the Linked Connections side of things.
Since the posts above, I have built
gtfs-via-postgres
, yet another tool to import GTFS data into a database. It also adds aconnections
view, which AFAIK is semantically very close to a list oflc:Connection
s; It allows keeping all connections stored in the GTFS-style compacted form (not "time-expanded") with reasonably fast queries.
I have tweaked gtfs-via-postgres
and use it for several performance-sensitive use cases where I access the GTFS data in a similar fashion (focusing on arrivals/departures instead of connections, but they're very similar storage-wise). It allows me to keep the GTFS in a relatively compact shape (roughly 4x the CSV size, e.g. 12gb with the 2.8gb Germany-wide GTFS feed) while allowing fast data access & analysis (see gtfs-via-postgres
' benchmarks).
gtfs-via-postgres
's connections
view is quite fast if you filter by stop, station or route, but it currently is not optimised for returning connections by date+time across all stops/routes (~7s for each access).
I now want to build a LC server that uses
gtfs-via-postgres
underneath, which is why I'm coming back to this thread: I think it would be worth it to isolate the HTTP server & linked data logic (paths, headers, content negotiation, geographic area, ) from the data storage logic and expose it or at least make it re-usable.
About a year ago, I have built this as gtfs-linked-connections-server
. I have not extracted the HTTP Linked Connections layer from it into a separate lib, but created https://github.com/derhuerst/gtfs-linked-connections-server/issues/1 as a tracking Issue.
gtfs-via-postgres
'sconnections
view is quite fast if you filter by stop, station or route, but it currently is not optimised for returning connections by date+time across all stops/routes (~7s for each access).
The connections
view is still not optimised: Upon querying /connections?lc:departureTime
, PostgreSQL will compute all connections in the dataset after the specified departure time, order them, and then return the specified number of connections (same with lc:arrivalTime
in the other direction). Not sure how to optimise this while retaining the correct DST behaviour.
About a year ago, I have built this as
gtfs-linked-connections-server
. I have not extracted the HTTP Linked Connections layer from it into a separate lib, but created derhuerst/gtfs-linked-connections-server#1 as a tracking Issue.
gtfs-linked-connections-server
now supports /connections?{lc:departureTime,lc:arrivalTime}
, /connections/:id
, /stops?{before,after}
& /stops/:id
.
I'm not sure if I got the TREE stuff right, and I haven't tried consuming with a linked--data-aware client yet. I still think this should be handled by a generic TREE server lib, where you would pass in metadata as well as data retrieval functions.
A random only somewhat related thought: It don't know Rust very well, but it seems like this generic TREE server lib would fit Rust's trait model very well, given that any other code from any unrelated domain could still easily adopt the TREE HTTP semantics.
@derhuerst Do you want us to validate it somehow and test it with an RDF library?
Do you want us to validate it somehow and test it with an RDF library?
That you be a great contribution, yes!
We could also conceive the aforementioned TREE HTTP server – I think it would make both linked-connections-server
and gtfs-linked-connections-server
more focused.
Can you link me up with either an HTTP server that’s publicly reachable, or either set-up instructions in order to set up such an HTTP server locally?
mkdir gtfs-lc-test
cd gtfs-lc-test
# download GTFS
wget --compression auto -r --no-parent --no-directories -R .csv.gz -P vbb-gtfs -N 'https://vbb-gtfs.jannisr.de/2022-09-09/'
rm vbb-gtfs/shapes.csv
# import GTFS
env PGDATABASE=postgres psql -c 'create database vbb_2022_09_09'
export PGDATABASE=vbb_2022_09_09
npx --package=gtfs-via-postgres@4 -- gtfs-to-sql --require-dependencies --trips-without-shape-id --stops-location-index -- vbb-gtfs/*.csv | sponge | psql -b
# serve LC server
npx derhuerst/gtfs-linked-connections-server#1.2.1
I want to built Linked Connections wrapping sparse data sources, which I need to query for connections on demand. This means that:
How would this work with
linked-connections-server
?