osmlab / to-fix-backend

The to-fix server
BSD 3-Clause "New" or "Revised" License
15 stars 13 forks source link

Rewrite for postgis #113

Closed mcwhittemore closed 7 years ago

mcwhittemore commented 7 years ago

Per a bunch of conversations that we've been having at Mapbox, this PR represents the first pass at a rewrite to move this project onto PostGIS.

To keep the scope for this rewrite at a minimum, this PR is targeting these endpoints.

Along with moving to PostGIS, this PR moves the backend to express with documentation done by docbox. This is so it matches other APIs worked on by members of this org and so we can publish clean looking docs that support prose.

Things left to ship

Next steps

these should be considered after the api is up and running

Higher level thinking

(bullet points are broken into longer explanations below)

Empower automated creation and updating of task items

One of the main drivers for rethinking to-fix (at least for me) has been considering how to let robots start creating and updating items.

This requires an api that:

  1. has a clear contract (mostly via easy to read and find api docs)
  2. is always up and available (robots tend to throw things away rather than come back later)
  3. provides status feedback on item updating

Of these things, I think the first two have been decently handled so far by this refactor. That said, the third one still needs some work. I mentioned related concepts to this in Next Steps above. But the main question to answer is how do a robot that creates a lot of false positives learn that it is doing so? I don’t really have a solution for this yet. Currently, I think we need this version of the API in production before we can answer that as the first two points will help to-fix robot writers get to work.

Drops type from the URL and the api

In the current version of the API, the URL for listing items in a task looks like /tasks/:id/:type/items where id is the id of the task and type is the upload number of the task.

From talking with @Rub21, I’m pretty sure this concept was added to to-fix to help manage the presence of duplicate bugs between uploads of items to the task. Because the new version of the api enforces a unique id per item inside of a single task, I think we can get away with dropping this concept for now.

type also seems to have been used to track how much work was done for a particular upload of items to the task. Currently, the new api does not present a solution for this as stats are only tracked for the whole task per day. I’m not sure if this is still something we need to track, but as we move forward and we consider letting an item creator reopen items, we’ll need to think about this more.

Moves locking to the api consumer

The current version of the api auto locks a random item for a task when the user requests for it. This puts a lot of responsibility on the API that can be handled by the application using the api and it limits different ways of opting into work you want to do.

With the new api, locking is persisted and enforced at the API layer, but applications have to take the locking action. to-fix can achieve the same functionality as the current api, but doing a list request that only returns unlocked items, randomly selecting one of them, and locking that item.

Only tracks stats by task by day

Part of dropping the type effects how we track stats. Looking at the dashboards in use in to-fix, I think we can get away with only tracking stats per user, per task, per day. For any rollup stats, the application can request all of the stats and compute the rollup client side.

We might want to track the number of total items in the future, but I think this can wait until we’re running well.

Only add one item to a task at a time

The past version of the API handled uploading large GeoJSON files and it looks like this was likely the major way the users added items to to-fix. This version of the API drops that kind of support and I recommend that we stay away from it. Sending large files to the server can create unexpected load problems and can result in a confusing database state if the upload fails partially processed.

This same behavior can be done by parsing the GeoJSON file in the to-fix front end and sending each item to to-fix-backend one at a time. That said, if lots of items are going to be added to the task at the same time, it's possible that our stats tracking could get screwed up so we need to write good tests around this. As of right now, new items won’t affect stats as we don’t track task wide stats, but items that are being updated might result in a weird state and we need to look into this.

cc @Rub21 @kepta @batpad

kepta commented 7 years ago

@mcwhittemore this looks really great and can't wait to see this in production.

I had a bunch of questions regarding the PR if answered would give me a better perspective (some of them are low priority / trivial and can be ignored)

  1. Architecture The current implementation is focused on reducing the to-fix magic and giving more power to the client. Do we want this tradeoff?

  2. Clarity A step by step overview of the APIs to call for a regular usage of to-fix.

    • Create a task
    • Share the url with team
    • People start flagging [skip, error, etc]
    • Task is finished
  3. Scenario: If an item is locked by user X and when Y asks for that item, he gets an error. . How should the client handle this (timer, exponential backoff, throw an error...).?

  4. Clarity Correct me if I am wrong (@batpad @Rub21 ) . Each item appears to have a featureCollection instead of a feature. Was this added to make no assumptions regarding the data type and handle future extensibility?

  5. Clarity I am not sold on the idea of using put-item.js for multiple things. I guess @mcwhittemore wants to move them to separate functions in future. How would that work?

  6. Misc node v4 is not supported as per the docs, can we remove that from CI @batpad took care of this.

  7. Clarity (doubtful/low possibility) race conditions, is it possible while the node server is processing an item update request another request does it faster. In short, how does it handle the interleaving of db writes.

  8. Clarity If the lockedTill is expired (the current timing is 15minutes), how does the server handle it? I guess the server doesn't do anything until someone else requests for that particular item.

  9. Misc I guess it's possible for a user A to change the status of user B's active locked task.

  10. Misc Little uneasy with the multiple if else funnelling in server.js and put-item.js. Is it possible to consolidate them into one place or make them easier to grasp ?

  11. Architecture Can we have a way to get the next available item? Currently, I see the client has to figure out the available item and request the server for a lock. This may or may not succeed. It would be great if the server could find the available item, lock it on behalf of user and send it to the user. To put it in other words, wanted to know how the current approach benefits when compared with the old backend.

batpad commented 7 years ago

@kepta

The current implementation is focused on reducing the to-fix magic and giving more power to the client. Do we want this tradeoff?

In general as a philosophical question, I am in favour of this approach as much as possible. There are some specificities here like how do we handle race conditions between clients (time between fetching an item and locking it for eg.), which we should tackle, but I am 👍 on the broad approach being for the backend to do as little magic as possible and let the frontend take these decisions.

Clarity A step by step overview of the APIs to call for a regular usage of to-fix.

This would be really useful - how you're imagining this right now @mcwhittemore -- currently the fact that the frontend cannot just request a "random unlocked, unfinished item" seems slightly problematic to me -- pulling down all available items every time the user clicks "Next" seems non-ideal to me, especially for large tasks.

Clarity I am not sold on the idea of using put-item.js for multiple things. I guess @mcwhittemore wants to move them to separate functions in future. How would that work?

Is there a good reason we don't want to have two separate methods here - POST to create new items and PUT to edit existing items? My guess is because we don't want the client to have to know whether the item already exists, and have the backend handle de-duping? In this case, it seems fine to just think of logically separating this into files / modules as per the OP.

@mcwhittemore

Only add one item to a task at a time

I'm fine with this, but then I think we should remove the option to upload GeoJSON from the frontend for now, and just create a simple CLI tool to handle the parallel requests, etc.

Moves locking to the api consumer

This sounds fine, but I worry about race conditions - where two different clients fetch an item with the assumption that it is unlocked, but only one actually succeeds in locking it.

In general, this looks great @mcwhittemore - thanks so much for putting this together -- we've started working through the Things left to ship issues you've identified, though it would be great to get clarity on some of the issues above as we think about the modifications needed on the frontend as well.

mcwhittemore commented 7 years ago

@kepta, there are great questions! Thanks. Thinking through them has helped me process what I’ve been thinking about this a bit more.

We haven’t had the chance to work together that much so I opted to be verbose in my explanations. Some people might see that as me saying this is TRUTH. Please read it as me being sure about what I’m saying but more than willing to admit to mistakes as they are pointed out. I’m sure there are mistakes and bad assumptions below.

I’ve given each of your questions its own header. @batpad asked one question that didn’t overlap with @kepta’s or come out of answering @kepta’s. I’ve called it out by addressing @batpad in the section's header.

Once again, thanks for the detailed questions. These were great to think about. :)

A step by step overview of the APIs to call for a regular usage of to-fix

Here is how I see this working. I think this is pretty close to how the current flow works with slight changes at steps 2/3 and 6/7.

  1. PUT /tasks/:task to create a new task
  2. PUT /tasks/:task/items/:item to create a new item
  3. Repeat 2 until all items have been created
  4. Share task URL with team
  5. GET /tasks/:task/items?lock=unlocked to list all unlocked items
  6. in client var item = unlockedItems[Math.floor(Math.random()*unlockedItems.length)] to pick a random item
  7. PUT /tasks/:task/items/:item { lock: ‘locked’} to lock the item
  8. PUT /tasks/:task/items/:item {status: ‘fixed’} to set the status to fixed and unlock the item (I’m not sure it should do the unlocking, but it does currently).
  9. Repeat from 5.

If an item is locked by user X and when Y asks for that item

What happens here depends on what the user Y is asking about the item.

If Y is doing GET /tasks/:task/items/:item then they will get a 200 and the JSON describing the item including details that explain that user X has a lock on it.

If Y is doing a PUT /tasks/:task/items/:item where the body looks like { lock: ‘unlocked’} or {status: ‘any valid value’} they will get a 423 error.

If Y is doing a PUT /tasks/:task/items/:item where the body modifies some other part of the task the update will work.

In the locking case, the client should do one of two things.

  1. If Y explicitly asked for a lock on the said item, the client should tell Y that the item is locked. In most situations, the client should be able to prevent this from happening as the item tells you it's locked.
  2. If Y asked for the next item and a race condition resulted in this error then the next item code should see this as a try locking another one. In this case, listing the unlocked items won’t include the item X locked anymore.

In the status case, the client should inform Y that they don’t have a lock. This is most likely to happen if Y had the lock but the lock expired, not if X has the lock as the client should see that Y doesn’t have a lock and not even make the status update request.

Was this added to make no assumptions regarding the data type and handle future extensibility?

In the ElasticSearch data model, the item is a GeoJSON Feature with rules around what different feature properties do. One of these properties was relations and was an array of GeoJSON Features. In talking with @batpad and @Rub21 two weeks ago, we decided that this same idea could be represented via a FeatureCollection where the properties of the features followed a defined schema. This still needs to be defined, but from the ElasticSearch model we know that there are at least two types of Features in to-fix an item and its relations. I’d recommend having a different name than item for the type in the feature collection though, to avoid confusion with the item itself.

I am not sold on the idea of using put-item.js

Yea. This is an artifact of writing code to get tests to pass. I learn largely by doing. As you mentioned, I think this should be broken up into 3 files. One for lock management, one of status management, one for other kinds of updates. I’d then have the endpoint definition in server.js infer from the request which of the three actions is being taken.

Also, before yesterday, update-status.js was in this file too. :man_facepalming:. This is the bulk of this application and thinking about how to manage it well is key to make sure this code is easy to read and easy to debug. I don’t think we’re there yet.

Is it possible while the node server is processing an item update request another request does it faster

Yep :(.

It's possible that user X and Y can both make an API request to lock an item. The server would get the item for both of them, both of them would see there is no lock, both would update the item with a lock. Both would update their stats to say they are working on the item.

Assuming Y’s request gets to the database last, Y would have the lock. Now when X goes to update the status, their request would fail because they don’t have the lock. This is generally OK, a bit bad from the user's experience, but the data is in a good state. What is NOT in a good state is X’s stats. They say he has a task in flight that he does not have in flight.

I suspect that this is OK for the first pass.

To fix this we need to write updates that check the updatedAt field is the same as the what was returned in the READ. I’m not 100% sure how to do this with Sequlize, but I’m sure it's possible.

We can also look into using transactions so that Postgres will undo writes if a later write fails.

If the lockedTill is expired, how does the server handle it?

I’m not 100% sure what this is asking. Locks do affect user stats so, until the lock is released, the user who locked the item will still appear to have a task under way.

If this is a problem, we can look into having a scheduled job that flushes these locks.

Does that answer your question @kepta?

I guess it's possible for a user A to change the status of user B's active locked task.

This is not possible. You must have a lock to change the status. This is a bit unclear in the code right now because put-item is doing everything, but there are multiple tests asserting that this is not possible. If there is a case you are unsure about, please add the test. The more tests we have for these kinds of edge cases the better.

Is it possible to consolidate the if/else statements in server.js and put-item.js into one place or make them easier to grasp?

Having good rules for what goes in the endpoint handler and what goes into the data model helper is hard but very important. We want to make sure that the HTTP contract is as removed from the data model layer as possible and that the data model layer is as removed from the HTTP contract as possible.

For most of the endpoints, I haven’t broken out the data model from the HTTP contract mostly because they are all very very simple right now. That said, this will change. List items, for example, needs logic for filtering the items.

Generally speaking, I try to validate the request in the endpoint handler. Right now this validation doesn’t cover optional values that affect the 4 different action this handler covers.

If we broke the create vs update check out of put-item and into a get-item function, you’d be able to know which of the four actions were being taken at the endpoint handler level and do all the validation here.

Here is how I see that working.

  1. Make sure there are no unexpected attributes
  2. Make sure there is at least one valid attribute
  3. Get the item
  4. If the item doesn’t exist, check that lock and status aren’t present, check the required create attributes are present, validate all attributes and CREATE.
  5. If the item does exist, and there is a lock attribute, validate that only the status attribute is present and UPDATE.
  6. If the item does exist, and there is a status attribute, validate that only the status attribute is present and UPDATE.
  7. If the item does exist, and lock and status are not present, validate the other attributes and UPDATE.

Can we have a way to get the next available item?

Generally speaking, its bad practice to have an endpoint that doesn’t result in the same outcome when given the same input. React and Flux did a really good job of promoting this in the frontend javascript. Just like React, most of API design is about state management and introducing unpredictability into your state management makes it hard to test and worse, hard for end users to reason about.

That said, there are always exceptions to these kinds of rules. So why don’t I think this is a good case? Here are three reasons. I’m not sure all of these are great, but I’ll walk through them.

  1. One way to lock and unlock an item
  2. Avoid application specific logic
  3. Move complexity to the place its easiest to fix bugs
  4. Avoid patch fixing race conditions bugs

One way to lock and unlock an item

In the ElasticSearch model the only way to lock an item was via the random item endpoint, but as we start to think about having robots create and update items, it's very possible, that’ll we will want them to lock an item as well. Robots need to explicitly address an item and either we need two ways to lock an item or we need match the robot's needs.

Avoid application specific logic

This is really application logic. One goal of this change is to treat the to-fix-backend as a general API for OSM task management rather than an application api for to-fix-frontend.

Move complexity to the place its easiest to fix bugs

We could solve most the code overhead problems caused by having two ways to lock/unlock via abstraction, but in doing so we add complexity in one code base to avoid complexity in another. This is a net-zero gain and so asking where is it easiest for people to fix bugs should be asked. From looking at the frontend vs backend projects of to-fix, it looks like we can iterate on the frontend faster and thus I think we should keep complexity there.

Avoid patch fixing race conditions bugs

One reason that’s been brought up for doing this is to avoid race condition bugs. Truth be told, the race condition would still be present, but it would be strongly guarded in this one case. While this might solve the highest priority bug, it often means that other, hard to see, race condition bugs are left unchecked. Rather than target this one case, we should make sure all of the API's code is safe against race conditions.

@batpad: We should remove the option to upload GeoJSON from the frontend for now

I’m not sure what is different between a CLI and web app that makes a CLI safer at this then the web app. The web app has to be updated to handle writing to this API which is why I recommended doing it there. That said, if most people are using CLIs already, then trimming down the frontend a bit more seem possibly worthwhile. @kepta do you have thoughts on this?

batpad commented 7 years ago

Thanks much for the specificity and clarity in your answers @mcwhittemore .

Just wanted to flesh this bit out a bit to get more specific on how you're seeing the flow for the client to fetch and assign unlocked items:

GET /tasks/:task/items?lock=unlocked to list all unlocked items in client var item = unlockedItems[Math.floor(Math.random()*unlockedItems.length)] to pick a random item PUT /tasks/:task/items/:item { lock: ‘locked’} to lock the item

So, as a client - when the user opens a task, I fetch a list of unlocked items. Then I randomly assign items to users. I would need to make an additional request before the user starts working on the item to check that it is not locked by someone else in the meanwhile? Or should the client optimistically lock a bunch of items after fetching from the server?

So, I agree with everything you mention, and would 100% agree that the current way of the server locking an item on fetching an item makes things really hard to reason about. And having an end-point that randomly returns different results makes things hard to reason about.

Trying to take a small step back here and try and outline the problems we are trying to solve and see if there's an alternative solution:

Things we are pretty sure about architecture-wise:

A proposal (not hugely super well thought out):

When items are being added to a task, we populate an order field or so in the database that is an integer field that we can index on and ORDER BY. When populating items in a task, we derive the value for this field in a pseudo-random, but replicable way - eg. hash the FeatureCollection of the item - this will ensure the possibility of ordering features by something other than the order in which they were uploaded, and give us a reproducible and easy to reason about way to say something like - "give me a list of unlocked items ordered by the order field with a limit of 1".

Maybe I'm over-thinking this, but I think it's really important for this to be seamless on the client, both ensuring that two people never (or extremely rarely) get the same item, and making sure the transition from one item to the next is as quick as possible, and that there are end-points in the API to support that, and it would be good to achieve at least feature parity in terms of user experience with the current backend.

It's also entirely possible I'm misunderstanding something here, would be happy to chat and try and hash this out a bit.

Most other things seem really clear and well thought out to me. Looking forward to continuing building this out.

mcwhittemore commented 7 years ago

hmmm... @batpad, I'm not sure this is a big problem.

Below I wrote some code for how I see the client handling the possibility an item being locked while trying to lock it. The code assumes that the backend has solved the race condition problem we outlined above. It also includes raven for logging lock conflicts to sentry. This way we could look at the data and find out how often lock conflicts are being seen by the client.

var getUnlockedItems = require('…'); //resolves a list of items
var lockItem = require('…'); // resolves the item passed to it, rejects if that item is locked

// resolves null if there are no items that can be locked
// resolves an item when the user has a lock on it
var getNextItem = module.exports = function() {
  return getUnlockedItems().then(function(items) {
    if (items.length === 0) return null; // no unlocked items
    var nextIndex = Math.floor(Math.random()*items.length);
    var next = items[nextIndex];
    return lockItem(next).catch(function(err) {
      if (err.message === 'item is locked') { // not the real error message...
          raven.logError(err); // I forget how you do this, this is just a guess
          return getNextItem(); // try locking another one
      }
      throw err; // just in case the error isn’t about the lock
    });
  });
}

Using this code we know:

  1. a user will never get a locked item
  2. a user will always get a random item
  3. a user might get an item close to another user's locked item
  4. all lock conflicts will be logged

The 3rd point is the problematic one. Random seems to have been working for us so far so I suspect we're ok here. It's possible that the difference between Postgres and ElasticSearch will result in random not being as good. Let's track that and find out.

Hashing the FeatureCollection might improve this a bit, but I think its putting energy into the wrong problem. Selecting random items does not solve 3 no mater how the random item is selected or where random the action is. It might get better, but not perfect. Rather than spend time getting random to work well, we should find out how much of a problem this is and then, if/when its a problem, consider a spacial lock.

batpad commented 7 years ago

Below I wrote some code for how I see the client handling the possibility an item being locked while trying to lock it.

Ah, I think I slightly misunderstood what you were proposing above - the code example makes things clear.

So here are my questions around this -- in the code outlined above, I see you're envisioning the client making a request to getUnlockedItems for every time it needs to get a next item. How do you see this working for very large tasks? Of course, it's probably unreasonable for the client to fetch 1000s of items for every time it needs to get a "next" item. Do you see a reasonable default pagination here? Like, fetch 10 unlocked items or so? This proposal seems fine - I just want to try and find a balance of the client not having to make heavy network requests for every next item, as well as reduce the chance of conflicts.

My main concern here is just the end user experience and any delay between fetching the next item is a waste of someone's time. A way to work around this might also be for the client to always pre-fetch the next item, lock it, and keep it ready for the user when they click "next" - and then fetch the item after that and so forth, so that the user does not feel the lag time of these network requests.

Rather than spend time getting random to work well, we should find out how much of a problem this is and then, if/when its a problem, consider a spacial lock.

👍 agree - this all sounds good - can we agree on a default pagination for getUnlockedItems so that this remains sane regardless of how many items actually exist in a task?

Thanks for your patience here @mcwhittemore - talking the specifics through in detail really helps in clarifying things and making sure we're on the same page moving forward.

mcwhittemore commented 7 years ago

fetch 10 unlocked items or so

Having a max page size is great. I'd do something more than 10. If 5 people open to-fix at the same time I suspect that there would be at least one conflict with 10. 50 might be a good number. We can make the listing endpoint only return the needed details (aka, not the featueCollection) so that the size isn't greatly changed by more results. This is a pretty common thing to do with listing endpoints.

Rub21 commented 7 years ago

I could not understand well, We are going to fetch 10 items or more per request everyone who wants to get an issue from to-fix. ?

We also have to-fix client https://github.com/JOSM/tofix for JOSM, which request the items one by one.

Rub21 commented 7 years ago

We have Connectivity-linting stack, which is doing the detection of OSM issues using osmlint.. These detections are stored in S3 and a lambda function which is updating the to-fix tasks.

All detections made by connectivity-linting: s3://tofix/tasks/geojson/*

I think we should have an endpoint to update the task, with a user that the session does not expire. in the previous to-fix we call it as a machine user.

mcwhittemore commented 7 years ago

Thanks for the questions @Rub21.

we should have an endpoint to update the task, with a user that the session does not expire

Yea, having machine or robot users is an important this for us to nail once to-fix is using this new backend.

We are going to fetch 10 items or more per request

The JSOM extension should do one list items request. That request will return somewhere between 10 and 50 items.

mcwhittemore commented 7 years ago

Looking over the TODO list in the opening comment, I think 2 of the thing things on there can happen later so I moved them to next. This leaves us with just getting the list items endpoint:

So close!