bulk load project - Githubissues

kinlane commented 7 years ago

Proprosal to introduce a custom HSDA media type for bulk loading, which would be a single data point for GET and POST of data.

Open311 has a similar initiative - http://wiki.open311.org/

While still offering an "everthing/" path on main API, this would allow the core API to reflect the flat nature of schema, keeping usable by spreadsheet users.

This is the first proposal to use media types for breaking projects into smaller chunks so they do not affect other types of users -- meeting the needs of systems users, without impact say web, spreadsheet, SPA, or conversational interface users.

While I am putting this under v1.2 -- the bulk API would branch and take on its own versioning, as separate branch.

timgdavies commented 7 years ago

I am cautious about this proposal.

The alternative, non-API approach to this, is to simply encourage publication of HSDS files to a web-accessible URI, and then pass the consuming system the URL of that file.

In general, I think of bulk loads as a 'pull' process, rather than API 'push'.

And if you need to push a bulk HSDS file to an API, that might be better achieved by a script that reads the HSDS and then pushes item-by-item, rather than asking API implementing systems to accept bulk files?

kinlane commented 7 years ago

Let's add discussion about this to meeting. I'm a little confused, as you seem to articulating what I'm proposing? I feel like there might be some disconnect about what an API is. You are proposing moving text/csv via HTTP (that is anAPI). Not sure why push or pull or stream via application/json would be different.

I'm suggesting the moving of files via HTTP request. Not making any recommendations about file size, file type, or whether it is push / pull / webhook / stream, or what the trigger is yet. Would like to open this up for more conversation.

I'm suggesting need to push a bulks HSDS file to an API, based upon people asking for this functionality during the v1.1 feedback phase. This issue to to give an issue to continue this feedback out of scope of basic API.

NeilMcKLogic commented 7 years ago

Maybe the difference @timgdavies is pointing out is that if I have a file containing HSDS data (or actually a set of files, per the spec), rather than synchronously push them to an API for ingestion, I might prefer to transfer it asynchronously via (S)FTP to the recipient who can then manage the ingestion in a throttled way. This would be a great way to start a data sharing relationship in cases where it is a large amount of data.

kinlane commented 7 years ago

I would argue that batch API HTTP approaches have exceeded what is possible via FTP. Everything in space is leading towards doing things in smaller chunks. APIs being microservices. Microservices running in containers. With large volumes running as jobs that act as transactions which can be run in optimal times, queues, and rolled back. With webhooks these become two way streets, with notifications of evens, either micro or macro.

If FTP is access formula, I would say HSDS in an FTP location is your jam, I wouldn't burden an HSDA specification with this path.

However, for a living, breathing, evented solution for moving large volumes of data around in real time HTTP is well suited, and provides many modern solutions. HTTP/2 using gRPC can provide multi-threads (which FTP can't), and with the event system every knows what sup, and nothing breaks. HTTP is well suited for large volumes, and if performance is needed then HTTP/2 is the jam, and protobuf can be generated from OpenAPI definition -- See ALL Google modern APIs for reference. All new APIs are OpenAPI + gRPC by default to handle multi-load scenarios.

I would encourage seeing my spec, before shutting down entirely.

openreferral / api-specification

bulk load project #58