Open gr2m opened 7 years ago
I'm in!
After working with Gregor today to gain a better understanding of the vision for tasks (thanks @gr2m!), here's what I propose are the low fruit todos to get started:
More to come, please comment on additions/changes to this list and I'll update accordingly. Thanks!
👍 sounds good!
@gr2m you wrote:
Unfortunately there seems to be a bug in hoodie-standalone-task right now, I couldn’t get it work myself, but it worked in the past. So first task will be to get it working again :)
Can you explain the buggy behavior? I'm running it now and it would be helpful to know what you were experiencing before I dive deeper. So far the PouchDB instance starts, client bundling occurs, and on the surface it seems to be running ok aside from the 404 errors noted as expected. Granted I haven't tried to do anything useful with it yet. :)
When I run npm start
in hoodie-standalone-task
after installing dependencies, I see
$ npm start
> hoodie-standalone-task@ start /private/tmp/hoodie-standalone-task
> bin/server.js
161129/210509.099, [log,couchdb-task], data: Starting PouchDB Server...
161129/210510.078, [log,couchdb-task], data: PouchDB Server ready at localhost:1234/_utils
Server running at: http://gregor-2.local:3000
161129/210517.457, [response], http://gregor-2.local:3000: get / {} 200 (21ms)
161129/210517.480, [response], http://gregor-2.local:3000: get /task.js {} 200 (28ms)
161129/210517.694, [response], http://gregor-2.local:3000: get /api/queue/ {} 404 (6ms)
161129/210517.833, [response], http://gregor-2.local:3000: put /api/queue/ {} 404 (7ms)
161129/210517.857, [response], http://gregor-2.local:3000: get /favicon.ico {} 404 (2ms)
161129/210518.570, [response], http://gregor-2.local:3000: get /api/queue/ {} 404 (2ms)
161129/210518.575, [response], http://gregor-2.local:3000: put /api/queue/ {} 404 (3ms)
161129/210519.444, [response], http://gregor-2.local:3000: get /api/queue/ {} 404 (2ms)
161129/210519.450, [response], http://gregor-2.local:3000: put /api/queue/ {} 404 (1ms)
the get
s and put
s continue, something’s not right there. I tested with a fresh install using node v6.9.1 and npm 3.10.9
But note that the setup that spawns pouchdb-server
in a child process is outdated, we no longer use it. Instead, @hoodie/task-server
should either proxy to a CouchDB (maybe we can start with that) or use express-pouchdb
in other cases. Here is how @hoodie/store-server does it
I’d spend a moment to try fix the setup with hoodie-standalone-task
, but if it becomes a rabbit whole, I’ll try to do the same we do in@hoodie/store-server
and compare to how https://github.com/hoodiehq/hoodie-store works, it is to the Hoodie Store module what hoodie-standalone-task
is to the Task module
An update:
After a thorough review of the hoodie-standalone-tasks repo, and an attempt to repair, Gregor and I determined that it's just going to take too much to bring it up to date and functional. Instead, I've started the process of creating a new standalone "hoodie-task" working example using hoodie-store as the basis.
So far, I've managed to get the REST portion of the tasks-server working with the endpoints as originally conceived - minus any queue id filtering, which will be necessary in the end product. Aside from filtering, next up will be the task server side API, which will surely include support for task events much like the store API does today. And since this is all based on the store-server module, I believe much of that will already be baked in through the Hoodie Store API. I also plan to leverage the other Store APIs that should allow for server-side manipulation of the lifecycle of tasks, including creation, updates, closing, etc.
All of this will likely make more sense when the code is pushed, but I wanted to get the basic thought process out there for comments sooner than later. More to come!
@gr2m (and anyone else that's interested) - please have a look at this approach and let me know if I'm on the right track. Thanks much!
@gr2m - as it relates to the start() method here:
queue.start(attributes/*, options */)
I'm wondering if we should go with something like this instead:
queue.add(attributes/*, options */)
For me personally, I would think of queue.start() as a method for starting the entire queue, rather than initiating a single task item.
your thoughts?
With that ^ in mind, we'd likely want something like the following as well:
queue.remove() // was abort()
queue.readd() // was restart()
I think neither are perfect, both add and start look like they just add the task, while the method only resolves after the full life cycle was executed. But I think the functionality we will need is clear and I would suggest to focus on that first, to get the task module itself working. Once we have all in place, we should have an extended discussion about the naming of the methods
While working on the task modules, there's been some design concepts - some old and some new - that have been discussed between Gregor and I that we feel should be documented and up for discussion while development is underway. Here's the (work-in-progress) list of those items that collectively describe the new tasks approach. I'll do my best to update it as discussions occur in the various channels and as development progresses. Please be sure to reach out either here or in the Slack #Dev channel if any of you have questions, ideas, or comments. Thanks much!
task-client
and task-server
task-server
provides a task management server api and defines routes for synctask-client
provides a client api for the task queue, using PouchDB for syncattributes
defined by the clienttask-client
api methods will include .start()
, .on()
, .one()
, and off()
task-client
api methods will NOT include .stop()
or .restart()
task-client
start()
will return a Promise with the fully updated taskDoc
task-server
will emit add
, update
, remove
, and change
task eventstask-client
will emit events upon start
, success
, error
and (new) progress
taskDoc.progress
array property will track the entire lifecycle of the tasktaskDoc.success
or taskDoc.error
.progress()
task-server
method will allow for status updates to the client.success()
,error()
, on()
, .one()
, and .off()
task-server
methods will remainsuccess
and now also for error
will result taskDoc
removaltask-client
via a taskStore
change eventtask-server
is dependent on store-server-api
- wrapping many of the same api methodstask-client
is dependent on store-client
- providing the sync with the tasks db
Additionally:
tasks DB
and will be filtered by user, or...task user DBs
(more to come - including links to relevant issues/materials - this is a quick list off the top of my head)
On the server, task docs will be stored in the tasks db and will be filtered by user
Meaning using a CouchDB replication filter to only sync tasks docs to a client that belong to a specific user?
There are two problems with this:
store-server
module only does per-database authentication and authorisation. Per-doc authz inside task-server
is not advisable (as you need to make absolutely sure you don’t accidentally leak any information from other users., and I wouldn't trust myself to get that right. The only reliable way to do this is a task db per-user. Would that be an option?task-server will emit add, update, remove, and change task events task-client will emit events upon start, success, error and (new) progress
Why different APIs on client and server? Could we get a full list of APIs on both sides, including methods and the events they issue?
Maybe something like this:
client.task.start(attributes)
-> server.on('started', function (attributes) {})
Sorry if that exists elsewhere, pointers appreciated ;)
@janl - thanks for the questions!
Meaning using a CouchDB replication filter to only sync tasks docs to a client that belong to a specific user?
I'm admittedly still coming up to speed on how we might accomplish the filtering, so I'll defer to @gr2m to explain the conceptual approach, but I believe the intent is to handle that on the db side of the proxied routes, to ensure there's no leakage. As you can see, that hasn't been determined yet, so your feedback is important here.
Why different APIs on client and server? Could we get a full list of APIs on both sides, including methods and the events they issue?
The (limited) documentation on task-server
api can be found here, and the (slightly outdated) task-client
api can be found here.
Valid points! We'll definitely have to give some thought to what you're suggesting.
with large number of users, this becomes a very slow operation. I’d say for now we can get away with 1k-10k users, but beyond that I’d be careful.
I had the discussion with @glynnbird who is working on envoy, a solution that emulates a-database-per-user while still storing all data in a single database.
What we were thinking is to let only initial bootstrapping request through. Once they get "since=now", we would handle them in the Node server. That way we would have a single continuous /_changes request to the database, and then using events in the Node server to respond to the continuous changes requests from the client. This must be well tested and secure of course, but it certainly would scale better than having thousands of separate databases with open changes requests, I think? And the other problem with separate databases is that we won’t be able to query data across them easily, which is more important for tasks than for user data.
if a user has access to one doc in a db on a server, they have access to all other ones
Users won’t have access to the database at all. They can’t do GET /hoodie/store/api/tasks
to access the tasks database, it’s accessible to admins only. We will only expose routes required for replications on the task module. And filtering should be fairly simple and secure as we will use _id
prefixes.
And the other problem with separate databases is that we won’t be able to query data across them easily, which is more important for tasks than for user data.
How is this more important for tasks? What use-cases or implementation details are you considering? What would be the problem with a task db per user?
One concern is listening to database changes, and that’s easier with one vs. one per user. With CouchDB 2.0, we can already build an efficient “listen to all databases’ _changes feeds” e.g. fairly easily. For CouchDB 1.x it’d be semi efficient, but we have had this code in Hoodie before.
I’d say the “query across dbs” is a very different concern from what we are trying to do here, so I’d like to suggest to keep that a separate issue as long as it isn’t required for implementing a feature. E.g. later, we might consider this so important that we put all user’s dbs behind a single envoy database.
FWIW, I haven’t had time to look into envoy thoroughly, but from a glance (and from the README), this is in very early stages and I wouldn’t bet a major Hoodie feature on this just yet. If this proves useful, this could relatively easily be added to CouchDB proper, and then we can just use it from there.
if a user has access to one doc in a db on a server, they have access to all other ones Users won’t have access to the database at all.
Then I don’t understand how you want to do filtered replication. It means accessing the db.
Or do you mean tasks should work online only?
The (limited) documentation on task-server api can be found here, and the (slightly outdated) task-client api can be found here.
@inator Thanks for the pointers. I wanted to make clear that I’m okay with having different APIs on client and server, because the environments are different. But I think the same concepts should have the same name on both sides.
E.g. on the client we do task.start()
but to listen for a started task on the server we do task.on('add', handler)
. I’d make them the same. At the moment it looks like the server concepts are named after hoodie.store
(or at lest Couch/Pouch document semantics) and the client concepts are named around tasks (not store/documents). There might be a good reason for this, but it might be worth considering mirroring these concepts on both sides. (I hope this wasn’t too obtuse. Let me know if anything here is unclear).
@janl don’t bother looking at all of envoy, I don’t think that more than 10 lines will be relevant for our case. In POST /hoodie/task/api/queue/<id>/_bulk_docs
we check if all docs are prefixed with <id>
. In GET /hoodie/task/api/queue/<id>_changes
we only return documents that are prefixed with <id>
. I think that’s very trivial, I don’t see the complexity in this?
I would also separate the discussion on tasks and user databases. Tasks will be simpler, we only expose replication APIs and don’t even need to add/remove _id
prefixes on-the-fly, we can store documents with prefixes on the client already.
How is this more important for tasks? What use-cases or implementation details are you considering? What would be the problem with a task db per user?
In the admin dashboard, I want to see all tasks that failed across all user accounts with their error messages. And I want to query all tasks that succeeded. I don’t know how to make this possible with a database per user.
If this proves useful, this could relatively easily be added to CouchDB proper, and then we can just use it from there.
It must work with PouchDB adapters for persistence, too, so we will have to implement it within Hoodie either way
In the admin dashboard, I want to see all tasks that failed across all user accounts with their error messages. And I want to query all tasks that succeeded. I don’t know how to make this possible with a database per user.
hoodie-task-server
could write this information into a global task database that is only available to admins on('success')
and on('error')
for example. Or if we don’t want that code complexity, we replicate all user-task dbs into a admin-only global task db and do the queries there. Or even simpler: we could write the errors/successes into a log file and display that in the admin interface
I would also separate the discussion on tasks and user databases.
My question was more meant like this: how important is it to see globally tasks failed/succeeded? As opposed to e.g. only failed per user, or a log file of errors? And then: there are many features come to mind that would require a query across all user dbs. How important are these features? Should we rather make a strategic decision now to use only one large db for all user data and one for tasks, so we can do all these things. — I’m not necessarily advocating for this, but If we do this, we better do it now.
I think that’s very trivial, I don’t see the complexity in this?
Those are not all the APIs you need for replication. The complexity needed to proxy every request needed for replication and make them safe so we can guarantee that no data is accidentally leaked is way too high for us at this point.
We’ve talked about all this when moving the authentication to the Hoodie layer and the only reason I was, in the end, happy to go with it, is because we do the auth business in the HTTP headers, and it doesn’t matter what kind of HTTP requests are made.
There are two options here that are safe:
Tasks will be simpler, we only expose replication APIs
The statement that makes me nervous is “only the replication API”. There is no such thing. CouchDB replication just uses the APIs that regular end-users can use. There is no “we allow replication, but not regular access to the database”. That means we have to secure every single request to CouchDB and as stated above, I believe the complexity for this is too high for Hoodie at this point in general. And there are are alternatives for solving the required backend features.
and don’t even need to add/remove _id prefixes on-the-fly, we can store documents with prefixes on the client already.
What if user jan
creates a doc with prefix gregor
? Can I start tasks for you? Do I get updates for your task then? Is the task sending me an email of all your user data? We can’t trust the client for this. The server will have to do the prefixing based on the authentication that comes with the username.
If this proves useful, this could relatively easily be added to CouchDB proper, and then we can just use it from there.
It must work with PouchDB adapters for persistence, too, so we will have to implement it within Hoodie either way
I meant Envoy. Ideally, to PouchDB, this looks like db-per-user, but behind the scenes it isn’t. If anywhere, this functionality should live in CouchDB and PouchDB need not know about it.
One more thing, since you mention scale. Either way, this would mean a continuous _changes stream from client to server for the regular per-user db and one for the tasks database (doesn’t matter if one single, or per-user). Are we prepared to do this?
What if user jan creates a doc with prefix gregor? Can I start tasks for you? Do I get updates for your task then? Is the task sending me an email of all your user data? We can’t trust the client for this. The server will have to do the prefixing based on the authentication that comes with the username.
Yes I thought about this, all requests would be authenticated with a random token that is generated the first time you create a task, the token is stored in a _local
doc in your local task database and the central task database on the server (we will probably need an additional route for it, or require it as an argument on the initial PUT /hoodie/task/api/queue/<id>
request). Once you signed up for an account, the auth check gets replaced with using the user session instead. Both would use the Authorization
header, but would use different prefixes, e.g. tasktoken abc4567
and session def8910
. When a user signs up, we delete the _local
doc in both the local task
database as well as the big central one on the server.
How important are these features? Should we rather make a strategic decision now to use only one large db for all user data and one for tasks, so we can do all these things. — I’m not necessarily advocating for this, but If we do this, we better do it now.
I think this is very important and we should come up with a decision soon, as the migration might be complex. I don’t mean to push for this to hard, I know this is a security concern.
I would suggest the following: We move on with the task module in its unsecure state. We will not merge it into hoodie
. But at least we can see how the higher-level APIs work and feel on the server and the client, we can create a few demos and create tests, maybe both for functionality but also load tests of some kind? This way we could decouple the higher level API work that @inator is moving forward right now and the architectural discussion we have right now. And once we have some tests in place, we could experiment with different approaches for the architecture and see how they compare? I am lacking a lot of insights that you have from all your work on CouchDB, I think it would help me a lot to experiment with actual code, even if it’s for learning only.
@inator Thanks for the pointers. I wanted to make clear that I’m okay with having different APIs on client and server, because the environments are different. But I think the same concepts should have the same name on both sides.
E.g. on the client we do task.start() but to listen for a started task on the server we do task.on('add', handler). I’d make them the same. At the moment it looks like the server concepts are named after hoodie.store (or at lest Couch/Pouch document semantics) and the client concepts are named around tasks (not store/documents). There might be a good reason for this, but it might be worth considering mirroring these concepts on both sides. (I hope this wasn’t too obtuse. Let me know if anything here is unclear).
@janl - Sorry for the delayed response. Makes perfect sense! I opened this issue to look at making nomenclature and method names consistent where appropriate.
I would suggest the following: We move on with the task module in its unsecure state. We will not merge it into hoodie. But at least we can see how the higher-level APIs work and feel on the server and the client, we can create a few demos and create tests, maybe both for functionality but also load tests of some kind? This way we could decouple the higher level API work that @inator is moving forward right now and the architectural discussion we have right now. And once we have some tests in place, we could experiment with different approaches for the architecture and see how they compare? I am lacking a lot of insights that you have from all your work on CouchDB, I think it would help me a lot to experiment with actual code, even if it’s for learning only.
Heads up - there's been some offline discussion on this, and ultimately @gr2m and I have elected to move forward with this plan:
Please be sure to let us know if there's any concerns with the approach (and other tasks related concepts outlined above). Thanks much!
Feature list updated:
task-client api methods will include .start(), .on(), .one(), and off() For the time being task-client api methods will NOT include .stop() or .restart() task-client start() will return a Promise with the fully updated taskDoc
Additionally:
On the server, task docs will be stored in the tasks DB and will be filtered by user, or... On the server, task docs will be stored in individual task user DBs
thanks for keeping us in the loop, Dale!
Hey @inator how are things? Do you have any progress on the task module?
This is a high-level issue for a complicated task, meant for discussion and to find a "champion" who wants to take ownership of bringing background tasks back to Hoodie. Comment below if you are interested in helping us :)
❓ The Motivation
Background tasks can be used to trigger server-side logic provided by plugins. Tasks are simply JSON documents just like data. And just as with data, we have offline sync, so users can start tasks even without an internet connectivity, which is more robust than relying on REST APIs.
🤔 What you will need to know to work on this issue
Node.js, hapi, PouchDB and you should have a general understanding of Hoodie’s sync architecture.
We had background tasks implemented in the "old" hoodie, tasks were simply stored in the users database and synchronized to the user’s database in the backend. We found out eventually that this was a bad design decision, as we want to use tasks even if the user did not yet sign up, and listening to all users databases has a very bad performance impact. So for the new hoodie, we decided to go with a dedicated "tasks" database instead.
We already have @hoodie/task-client which is very little code, it is using [@hoodie/store-client under the hood. And we have @hoodie/task-server which exposes only the CouchDB APIs relevant for sync. The idea is that the server is filtering
/_changes
so that a user can only see the tasks belonging to them, but it is not yet implemented. I would recommend to compare to hoodie-store-server’s implementation and start from there. Note thathoodie-store-server
also abstracts away whether the app is using a CouchDB as its backend and if not falls back to PouchDB. We need the same for the task server module.And lastly there is hoodie-standalone-task, a module we created to test the client & server together. I would recommend to use it for local development and testing. Once we got it all working in
hoodie-standalone-task
we can add the server & client code tohoodie
itself (in@hoodie/client
and@hoodie/server
respectively).hoodie-standalone-task
is only a temporary repository so whoever is championing this, I’m happy to give you full admin rights so you can just merge your own pull requests yourself as you need it. For example, I think it would make sense to extenddemo/index.html
to have some UI that might help with debugging, like logging events into a table for example. Unfortunately there seems to be a bug inhoodie-standalone-task
right now, I couldn’t get it work myself, but it worked in the past. So first task will be to get it working again :)