Bulk data exports - Githubissues

molomby commented 6 years ago

The ability to dump data from a Keystone, via the Admin UI (probably the current view). At a technical level there are two broad approches:

A) Pull the data into Node and produce a file for download

This gives us full control of the format (Excel file, CSV, JSON, etc) so is nicest for the user. We can pick columns, ordering and reference hydrated (app generated/virtual) values.

However, without some significant work (generating from a worker, batching items) this approach does not scale well. The prior art (KS4) uses this approach and i's known to have caused outages on large sites.

B) Leverage DB platform functionality

Alternatively, we may want to leverage functionality of the underlying DB directly. Specifically...

mongodump can be given conditions to filter the documents dumped from a collection. Issues: single collection only, not hydrated, format is difficult to consume, dumps all fields.
the pgsql copy command can write queries out directly to a CSV. It may also be possible to dump JSON in this way. Less issues than mongodump; copy can query across multiple tables/views, control columns, order, format values, etc. Also it's super fast. Locally I can dump 80k+ records (20Mb) to a CSV in under 200 ms.
Etc.. for other DB platforms.

These tool have a much higher capacity to scale but at the cost of some flexibility (esp. Mongo). This is also something we'd need to work into the adapter framework.

molomby commented 6 years ago

Current preference is to focus on "A" but architect it as a background process (#307) that batch reads/formats docs into a file. The user who requested the file could then be notified by email.

The approches outlined in "B" could do well as custom mutations (#309) sitting behind custom elements in the admin UI (#308).

VinayaSathyanarayana commented 6 years ago

Should we think of excluding some sensitive (privacy related) fields during export? The exclusions could be global or based on role

molomby commented 6 years ago

Yeah, it would run through all the ACL stuff, so users couldn't export data they couldn't read.

The "export value" should probably be something specifically supplied by field type though (and so should form part of the field type interface). There's really no reason anyone should be exporting Password hashes, for example.

VinayaSathyanarayana commented 5 years ago

Any update on this?

MadeByMike commented 5 years ago

I feel like this one could be a good candidate for a 3rd-party package

gautamsi commented 5 years ago

for this we would need extension points for adding UI elements in admin-ui`

molomby commented 5 years ago

I feel like this one could be a good candidate for a 3rd-party package

@MadeByMike, Yeah possibly, it really needs DB adapter support though. Keystone's job here is probably to define the API and let DB adapters implement it or not, any third party export tools can pick up from there.

@gautamsi is also right.

MadeByMike commented 5 years ago

Adding hooks to the Admin UI opens a lot of doors and is becoming a higher priority

stale[bot] commented 4 years ago

It looks like you haven't had a response in over 3 months. Sorry about that! We've flagged this issue for special attention. It wil be manually reviewed by maintainers, not automatically closed. If you have any additional information please leave us a comment. It really helps! Thank you for you contributions. :)

MadeByMike commented 4 years ago

@molomby are we clear enough now to say that bulk data exports is something that should be handled at the database layer and handled by a separate tool: https://github.com/keystonejs/keystone/discussions/299

With that and the adminUI extension points as well as custom queries it should be possible for people to make a basic bulk export feature in the AdminUI.

I would suggest that is enough to close this issue as it unlikely this is going to become a core feature for Keystone.

molomby commented 4 years ago

@MadeByMike writes..

are we clear enough now to say that bulk data exports is something that should be handled at the database layer and handled by a separate tool?

This is a bit of a quagmire because there are a few different dimensions we can solve for:

High-level -vs- low-level -- Is this something that should operate regardless of the DB platform used, eg. via GraphQL, getting all the benefits of access control, read hooks, etc. or should this be implemented directly against the DB for efficiency sake.

Scaleable -vs- naive -- The KS4 implementation of this was let you apply some filters in the admin UI then click a button to download all filtered items (as a csv I think). It was quite handy but also very naively built -- it ran as a long running web request and hurled a single (potentially huge) query at the DB. The lack of any hard limits made it quite easy to DOS your own system with a single click. Fundamentally, bulk actions like this should be async and performed by a background thread or worker process; eg. you'd queue an export and the system would email you or something when it was complete. This is obviously a lot more complicated.

Generic -vs- ad hoc -- For a scaleable solve (as above) the solution needs to work closely with the production infrastructure for things like worker processes and mail gateways. But we don't have any documented patterns or guidance on how this stuff should run so building a decent generic solution (either in core or as a 3rd party extension) isn't really possible yet. The only generic solution practically possible right now would be something simple that doesn't scale.

So what do we do?

I think the only viable solution in the short term is a naive, high-level version; like the KS4 one (but maybe with some hard limits enforced). It wouldn't need any of the complex infrastructure support and seems like a very partial solution to me (it's liable to lead devs in the wrong direction). But it would be undeniably useful on small projects.

So to reframe your question @MadeByMike, would we want this ^^ to be part of core? I'm not sure myself. Maybe not?

keystonejs / keystone

Bulk data exports #310