Difficult to determine when to stop loading RPDE pages

nathansalter commented 4 years ago

In a reasonably high-traffic website it is possible for an RPDE client to never reach the end of an RPDE feed with enough turnover of updates. If it takes your client 5 seconds to download an RPDE page and then process it, by the time you get the next page more updates might have happened. Due to the nature of the modified query param, it's impossible to tell if you're only getting updates since the last page or if you're basically fully up to date.

This could cause issues in high-traffic websites where clients continue to pull pages from the feed and are never able to determine when they have reached the end of the feed. If these clients are running on a cron, subsequent cron runs could also have this issue, leading to data integrity or even data loss issues.

I suggest that we add a parameter to the feed response, to indicate to a client when it should stop retrieving pages. This could be done in one of three ways:

A maxModified parameter, signifying to clients that once they pass that they should no longer crawl pages. This might cause an issue with caching.
A limitPages parameter, signifying to a client that they should crawl only this amount of pages in this request. A better approach but could cause clients to slowly drift out of sync
An isLast parameter, which signifies when a page is considered to have got the client up to date.

nickevansuk commented 4 years ago

@nathansalter interesting feedback as always! Can I just check you've seen the "last page definition": https://openactive.io/realtime-paged-data-exchange/#last-page-definition

This indicates when you've got all data in the feed, though you should continue polling indefinitely to get further updates (an RPDE feed has effectively infinite pages).

Could I also check you've seen the scaling considerations for RPDE (https://developer.openactive.io/publishing-data/data-feeds/scaling-feeds#worked-example), which explains how a publisher can control the polling load on their server for high traffic scenarios, leveraging cache headers and using a CDN.

Does this solve the issue?

nathansalter commented 4 years ago

@nickevansuk yes this does clear things up a lot, the intention is that clients will continue to pull the latest page, even if it's small and has very few items in it. Then the caching should help stabalise where the pages sit, and servers are allowed to implement their own backoff strategies.

Cheers for the clarification!

openactive / realtime-paged-data-exchange

Difficult to determine when to stop loading RPDE pages #97