hoodiehq / camp

:circus_tent: Welcome to Hoodie Camp!
https://hoodie.camp
Apache License 2.0
99 stars 55 forks source link

Background tasks (task module) #100

Open gr2m opened 7 years ago

gr2m commented 7 years ago

This is a high-level issue for a complicated task, meant for discussion and to find a "champion" who wants to take ownership of bringing background tasks back to Hoodie. Comment below if you are interested in helping us :)

❓ The Motivation

Background tasks can be used to trigger server-side logic provided by plugins. Tasks are simply JSON documents just like data. And just as with data, we have offline sync, so users can start tasks even without an internet connectivity, which is more robust than relying on REST APIs.

🤔 What you will need to know to work on this issue

Node.js, hapi, PouchDB and you should have a general understanding of Hoodie’s sync architecture.

We had background tasks implemented in the "old" hoodie, tasks were simply stored in the users database and synchronized to the user’s database in the backend. We found out eventually that this was a bad design decision, as we want to use tasks even if the user did not yet sign up, and listening to all users databases has a very bad performance impact. So for the new hoodie, we decided to go with a dedicated "tasks" database instead.

We already have @hoodie/task-client which is very little code, it is using [@hoodie/store-client under the hood. And we have @hoodie/task-server which exposes only the CouchDB APIs relevant for sync. The idea is that the server is filtering /_changes so that a user can only see the tasks belonging to them, but it is not yet implemented. I would recommend to compare to hoodie-store-server’s implementation and start from there. Note that hoodie-store-server also abstracts away whether the app is using a CouchDB as its backend and if not falls back to PouchDB. We need the same for the task server module.

And lastly there is hoodie-standalone-task, a module we created to test the client & server together. I would recommend to use it for local development and testing. Once we got it all working in hoodie-standalone-task we can add the server & client code to hoodie itself (in @hoodie/client and @hoodie/server respectively).

hoodie-standalone-task is only a temporary repository so whoever is championing this, I’m happy to give you full admin rights so you can just merge your own pull requests yourself as you need it. For example, I think it would make sense to extend demo/index.html to have some UI that might help with debugging, like logging events into a table for example. Unfortunately there seems to be a bug in hoodie-standalone-task right now, I couldn’t get it work myself, but it worked in the past. So first task will be to get it working again :)

inator commented 7 years ago

I'm in!

inator commented 7 years ago

After working with Gregor today to gain a better understanding of the vision for tasks (thanks @gr2m!), here's what I propose are the low fruit todos to get started:

More to come, please comment on additions/changes to this list and I'll update accordingly. Thanks!

gr2m commented 7 years ago

👍 sounds good!

inator commented 7 years ago

@gr2m you wrote:

Unfortunately there seems to be a bug in hoodie-standalone-task right now, I couldn’t get it work myself, but it worked in the past. So first task will be to get it working again :)

Can you explain the buggy behavior? I'm running it now and it would be helpful to know what you were experiencing before I dive deeper. So far the PouchDB instance starts, client bundling occurs, and on the surface it seems to be running ok aside from the 404 errors noted as expected. Granted I haven't tried to do anything useful with it yet. :)

gr2m commented 7 years ago

When I run npm start in hoodie-standalone-task after installing dependencies, I see

$ npm start

> hoodie-standalone-task@ start /private/tmp/hoodie-standalone-task
> bin/server.js

161129/210509.099, [log,couchdb-task], data: Starting PouchDB Server...
161129/210510.078, [log,couchdb-task], data: PouchDB Server ready at localhost:1234/_utils
Server running at: http://gregor-2.local:3000
161129/210517.457, [response], http://gregor-2.local:3000: get / {} 200 (21ms) 
161129/210517.480, [response], http://gregor-2.local:3000: get /task.js {} 200 (28ms) 
161129/210517.694, [response], http://gregor-2.local:3000: get /api/queue/ {} 404 (6ms) 
161129/210517.833, [response], http://gregor-2.local:3000: put /api/queue/ {} 404 (7ms) 
161129/210517.857, [response], http://gregor-2.local:3000: get /favicon.ico {} 404 (2ms) 
161129/210518.570, [response], http://gregor-2.local:3000: get /api/queue/ {} 404 (2ms) 
161129/210518.575, [response], http://gregor-2.local:3000: put /api/queue/ {} 404 (3ms) 
161129/210519.444, [response], http://gregor-2.local:3000: get /api/queue/ {} 404 (2ms) 
161129/210519.450, [response], http://gregor-2.local:3000: put /api/queue/ {} 404 (1ms) 

the gets and puts continue, something’s not right there. I tested with a fresh install using node v6.9.1 and npm 3.10.9

gr2m commented 7 years ago

But note that the setup that spawns pouchdb-server in a child process is outdated, we no longer use it. Instead, @hoodie/task-server should either proxy to a CouchDB (maybe we can start with that) or use express-pouchdb in other cases. Here is how @hoodie/store-server does it

I’d spend a moment to try fix the setup with hoodie-standalone-task, but if it becomes a rabbit whole, I’ll try to do the same we do in@hoodie/store-server and compare to how https://github.com/hoodiehq/hoodie-store works, it is to the Hoodie Store module what hoodie-standalone-task is to the Task module

inator commented 7 years ago

An update:

After a thorough review of the hoodie-standalone-tasks repo, and an attempt to repair, Gregor and I determined that it's just going to take too much to bring it up to date and functional. Instead, I've started the process of creating a new standalone "hoodie-task" working example using hoodie-store as the basis.

So far, I've managed to get the REST portion of the tasks-server working with the endpoints as originally conceived - minus any queue id filtering, which will be necessary in the end product. Aside from filtering, next up will be the task server side API, which will surely include support for task events much like the store API does today. And since this is all based on the store-server module, I believe much of that will already be baked in through the Hoodie Store API. I also plan to leverage the other Store APIs that should allow for server-side manipulation of the lifecycle of tasks, including creation, updates, closing, etc.

All of this will likely make more sense when the code is pushed, but I wanted to get the basic thought process out there for comments sooner than later. More to come!

inator commented 7 years ago

@gr2m (and anyone else that's interested) - please have a look at this approach and let me know if I'm on the right track. Thanks much!

gr2m commented 7 years ago

left a comment at https://github.com/appback/task-server/commit/8e42dc4933e7839329bd7c0b5d8373131a4d54f6#commitcomment-20270126

inator commented 7 years ago

@gr2m: First task-server PR now ready for your review.

inator commented 7 years ago

@gr2m - as it relates to the start() method here:

queue.start(attributes/*, options */)

I'm wondering if we should go with something like this instead:

queue.add(attributes/*, options */)

For me personally, I would think of queue.start() as a method for starting the entire queue, rather than initiating a single task item.

your thoughts?

inator commented 7 years ago

With that ^ in mind, we'd likely want something like the following as well:

queue.remove() // was abort()
queue.readd() // was restart()
inator commented 7 years ago

Todos updated

gr2m commented 7 years ago

I think neither are perfect, both add and start look like they just add the task, while the method only resolves after the full life cycle was executed. But I think the functionality we will need is clear and I would suggest to focus on that first, to get the task module itself working. Once we have all in place, we should have an extended discussion about the naming of the methods

inator commented 7 years ago

While working on the task modules, there's been some design concepts - some old and some new - that have been discussed between Gregor and I that we feel should be documented and up for discussion while development is underway. Here's the (work-in-progress) list of those items that collectively describe the new tasks approach. I'll do my best to update it as discussions occur in the various channels and as development progresses. Please be sure to reach out either here or in the Slack #Dev channel if any of you have questions, ideas, or comments. Thanks much!

Additionally:

(more to come - including links to relevant issues/materials - this is a quick list off the top of my head)

janl commented 7 years ago

On the server, task docs will be stored in the tasks db and will be filtered by user

Meaning using a CouchDB replication filter to only sync tasks docs to a client that belong to a specific user?

There are two problems with this:

  1. with large number of users, this becomes a very slow operation. I’d say for now we can get away with 1k-10k users, but beyond that I’d be careful.
  2. if a user has access to one doc in a db on a server, they have access to all other ones. The store-server module only does per-database authentication and authorisation. Per-doc authz inside task-server is not advisable (as you need to make absolutely sure you don’t accidentally leak any information from other users., and I wouldn't trust myself to get that right. The only reliable way to do this is a task db per-user. Would that be an option?

task-server will emit add, update, remove, and change task events task-client will emit events upon start, success, error and (new) progress

Why different APIs on client and server? Could we get a full list of APIs on both sides, including methods and the events they issue?

Maybe something like this: client.task.start(attributes) -> server.on('started', function (attributes) {})

Sorry if that exists elsewhere, pointers appreciated ;)

inator commented 7 years ago

@janl - thanks for the questions!

Meaning using a CouchDB replication filter to only sync tasks docs to a client that belong to a specific user?

I'm admittedly still coming up to speed on how we might accomplish the filtering, so I'll defer to @gr2m to explain the conceptual approach, but I believe the intent is to handle that on the db side of the proxied routes, to ensure there's no leakage. As you can see, that hasn't been determined yet, so your feedback is important here.

Why different APIs on client and server? Could we get a full list of APIs on both sides, including methods and the events they issue?

The (limited) documentation on task-server api can be found here, and the (slightly outdated) task-client api can be found here.

Valid points! We'll definitely have to give some thought to what you're suggesting.

gr2m commented 7 years ago

with large number of users, this becomes a very slow operation. I’d say for now we can get away with 1k-10k users, but beyond that I’d be careful.

I had the discussion with @glynnbird who is working on envoy, a solution that emulates a-database-per-user while still storing all data in a single database.

What we were thinking is to let only initial bootstrapping request through. Once they get "since=now", we would handle them in the Node server. That way we would have a single continuous /_changes request to the database, and then using events in the Node server to respond to the continuous changes requests from the client. This must be well tested and secure of course, but it certainly would scale better than having thousands of separate databases with open changes requests, I think? And the other problem with separate databases is that we won’t be able to query data across them easily, which is more important for tasks than for user data.

if a user has access to one doc in a db on a server, they have access to all other ones

Users won’t have access to the database at all. They can’t do GET /hoodie/store/api/tasks to access the tasks database, it’s accessible to admins only. We will only expose routes required for replications on the task module. And filtering should be fairly simple and secure as we will use _id prefixes.

janl commented 7 years ago

And the other problem with separate databases is that we won’t be able to query data across them easily, which is more important for tasks than for user data.

How is this more important for tasks? What use-cases or implementation details are you considering? What would be the problem with a task db per user?

One concern is listening to database changes, and that’s easier with one vs. one per user. With CouchDB 2.0, we can already build an efficient “listen to all databases’ _changes feeds” e.g. fairly easily. For CouchDB 1.x it’d be semi efficient, but we have had this code in Hoodie before.

I’d say the “query across dbs” is a very different concern from what we are trying to do here, so I’d like to suggest to keep that a separate issue as long as it isn’t required for implementing a feature. E.g. later, we might consider this so important that we put all user’s dbs behind a single envoy database.

FWIW, I haven’t had time to look into envoy thoroughly, but from a glance (and from the README), this is in very early stages and I wouldn’t bet a major Hoodie feature on this just yet. If this proves useful, this could relatively easily be added to CouchDB proper, and then we can just use it from there.

if a user has access to one doc in a db on a server, they have access to all other ones Users won’t have access to the database at all.

Then I don’t understand how you want to do filtered replication. It means accessing the db.

Or do you mean tasks should work online only?

janl commented 7 years ago

The (limited) documentation on task-server api can be found here, and the (slightly outdated) task-client api can be found here.

@inator Thanks for the pointers. I wanted to make clear that I’m okay with having different APIs on client and server, because the environments are different. But I think the same concepts should have the same name on both sides.

E.g. on the client we do task.start() but to listen for a started task on the server we do task.on('add', handler). I’d make them the same. At the moment it looks like the server concepts are named after hoodie.store (or at lest Couch/Pouch document semantics) and the client concepts are named around tasks (not store/documents). There might be a good reason for this, but it might be worth considering mirroring these concepts on both sides. (I hope this wasn’t too obtuse. Let me know if anything here is unclear).

gr2m commented 7 years ago

@janl don’t bother looking at all of envoy, I don’t think that more than 10 lines will be relevant for our case. In POST /hoodie/task/api/queue/<id>/_bulk_docs we check if all docs are prefixed with <id>. In GET /hoodie/task/api/queue/<id>_changes we only return documents that are prefixed with <id>. I think that’s very trivial, I don’t see the complexity in this?

I would also separate the discussion on tasks and user databases. Tasks will be simpler, we only expose replication APIs and don’t even need to add/remove _id prefixes on-the-fly, we can store documents with prefixes on the client already.

How is this more important for tasks? What use-cases or implementation details are you considering? What would be the problem with a task db per user?

In the admin dashboard, I want to see all tasks that failed across all user accounts with their error messages. And I want to query all tasks that succeeded. I don’t know how to make this possible with a database per user.

If this proves useful, this could relatively easily be added to CouchDB proper, and then we can just use it from there.

It must work with PouchDB adapters for persistence, too, so we will have to implement it within Hoodie either way

janl commented 7 years ago

In the admin dashboard, I want to see all tasks that failed across all user accounts with their error messages. And I want to query all tasks that succeeded. I don’t know how to make this possible with a database per user.

hoodie-task-server could write this information into a global task database that is only available to admins on('success') and on('error') for example. Or if we don’t want that code complexity, we replicate all user-task dbs into a admin-only global task db and do the queries there. Or even simpler: we could write the errors/successes into a log file and display that in the admin interface

I would also separate the discussion on tasks and user databases.

My question was more meant like this: how important is it to see globally tasks failed/succeeded? As opposed to e.g. only failed per user, or a log file of errors? And then: there are many features come to mind that would require a query across all user dbs. How important are these features? Should we rather make a strategic decision now to use only one large db for all user data and one for tasks, so we can do all these things. — I’m not necessarily advocating for this, but If we do this, we better do it now.

I think that’s very trivial, I don’t see the complexity in this?

Those are not all the APIs you need for replication. The complexity needed to proxy every request needed for replication and make them safe so we can guarantee that no data is accidentally leaked is way too high for us at this point.

We’ve talked about all this when moving the authentication to the Hoodie layer and the only reason I was, in the end, happy to go with it, is because we do the auth business in the HTTP headers, and it doesn’t matter what kind of HTTP requests are made.

There are two options here that are safe:

  1. per-user task dbs.
  2. no replication and using task-server routes from the client directly and not use replication for offline functionality.

Tasks will be simpler, we only expose replication APIs

The statement that makes me nervous is “only the replication API”. There is no such thing. CouchDB replication just uses the APIs that regular end-users can use. There is no “we allow replication, but not regular access to the database”. That means we have to secure every single request to CouchDB and as stated above, I believe the complexity for this is too high for Hoodie at this point in general. And there are are alternatives for solving the required backend features.

and don’t even need to add/remove _id prefixes on-the-fly, we can store documents with prefixes on the client already.

What if user jan creates a doc with prefix gregor? Can I start tasks for you? Do I get updates for your task then? Is the task sending me an email of all your user data? We can’t trust the client for this. The server will have to do the prefixing based on the authentication that comes with the username.

janl commented 7 years ago

If this proves useful, this could relatively easily be added to CouchDB proper, and then we can just use it from there.

It must work with PouchDB adapters for persistence, too, so we will have to implement it within Hoodie either way

I meant Envoy. Ideally, to PouchDB, this looks like db-per-user, but behind the scenes it isn’t. If anywhere, this functionality should live in CouchDB and PouchDB need not know about it.

janl commented 7 years ago

One more thing, since you mention scale. Either way, this would mean a continuous _changes stream from client to server for the regular per-user db and one for the tasks database (doesn’t matter if one single, or per-user). Are we prepared to do this?

gr2m commented 7 years ago

What if user jan creates a doc with prefix gregor? Can I start tasks for you? Do I get updates for your task then? Is the task sending me an email of all your user data? We can’t trust the client for this. The server will have to do the prefixing based on the authentication that comes with the username.

Yes I thought about this, all requests would be authenticated with a random token that is generated the first time you create a task, the token is stored in a _local doc in your local task database and the central task database on the server (we will probably need an additional route for it, or require it as an argument on the initial PUT /hoodie/task/api/queue/<id> request). Once you signed up for an account, the auth check gets replaced with using the user session instead. Both would use the Authorization header, but would use different prefixes, e.g. tasktoken abc4567 and session def8910. When a user signs up, we delete the _local doc in both the local task database as well as the big central one on the server.

How important are these features? Should we rather make a strategic decision now to use only one large db for all user data and one for tasks, so we can do all these things. — I’m not necessarily advocating for this, but If we do this, we better do it now.

I think this is very important and we should come up with a decision soon, as the migration might be complex. I don’t mean to push for this to hard, I know this is a security concern.

I would suggest the following: We move on with the task module in its unsecure state. We will not merge it into hoodie. But at least we can see how the higher-level APIs work and feel on the server and the client, we can create a few demos and create tests, maybe both for functionality but also load tests of some kind? This way we could decouple the higher level API work that @inator is moving forward right now and the architectural discussion we have right now. And once we have some tests in place, we could experiment with different approaches for the architecture and see how they compare? I am lacking a lot of insights that you have from all your work on CouchDB, I think it would help me a lot to experiment with actual code, even if it’s for learning only.

inator commented 7 years ago

@inator Thanks for the pointers. I wanted to make clear that I’m okay with having different APIs on client and server, because the environments are different. But I think the same concepts should have the same name on both sides.

E.g. on the client we do task.start() but to listen for a started task on the server we do task.on('add', handler). I’d make them the same. At the moment it looks like the server concepts are named after hoodie.store (or at lest Couch/Pouch document semantics) and the client concepts are named around tasks (not store/documents). There might be a good reason for this, but it might be worth considering mirroring these concepts on both sides. (I hope this wasn’t too obtuse. Let me know if anything here is unclear).

@janl - Sorry for the delayed response. Makes perfect sense! I opened this issue to look at making nomenclature and method names consistent where appropriate.

inator commented 7 years ago

I would suggest the following: We move on with the task module in its unsecure state. We will not merge it into hoodie. But at least we can see how the higher-level APIs work and feel on the server and the client, we can create a few demos and create tests, maybe both for functionality but also load tests of some kind? This way we could decouple the higher level API work that @inator is moving forward right now and the architectural discussion we have right now. And once we have some tests in place, we could experiment with different approaches for the architecture and see how they compare? I am lacking a lot of insights that you have from all your work on CouchDB, I think it would help me a lot to experiment with actual code, even if it’s for learning only.

Heads up - there's been some offline discussion on this, and ultimately @gr2m and I have elected to move forward with this plan:

  1. Move forward with the architecture as is with the single task database on the server and make the different parts work, then...
  2. Focus on the one task database per user model, as @janl prefers that due to security, while...
  3. Continuing to try to find a safe way of having a single task db for all tasks, and...
  4. Ensure that whatever architecture we implement (1 db per user or single tasks db) is ready to be included in Hoodie Core for the Hoodie Village Release

Please be sure to let us know if there's any concerns with the approach (and other tasks related concepts outlined above). Thanks much!

inator commented 7 years ago

Feature list updated:

task-client api methods will include .start(), .on(), .one(), and off() For the time being task-client api methods will NOT include .stop() or .restart() task-client start() will return a Promise with the fully updated taskDoc

Additionally:

On the server, task docs will be stored in the tasks DB and will be filtered by user, or... On the server, task docs will be stored in individual task user DBs

gr2m commented 7 years ago

thanks for keeping us in the loop, Dale!

inator commented 7 years ago

Todos updated

gr2m commented 7 years ago

Hey @inator how are things? Do you have any progress on the task module?