medic / cht-core

The CHT Core Framework makes it faster to build responsive, offline-first digital health apps that equip health workers to provide better care in their communities. It is a central resource of the Community Health Toolkit.
https://communityhealthtoolkit.org
GNU Affero General Public License v3.0
466 stars 218 forks source link

People created via SMS are not replicated to restricted users without reloading the app #4166

Closed sglangevin closed 6 years ago

sglangevin commented 6 years ago

Seen on standard.dev on 2.14.0-beta.9

To reproduce:

  1. Log in as a restricted user on phone/tablet
  2. Log in as a full access user or admin on your laptop.
  3. Use Medic Reporter to send N [Name] to register a new person in an area that your restricted user has access to.
  4. See that the person is created on the instance (check the area on your laptop).
  5. See that the person is not listed in the appropriate area on your phone/tablet.
  6. Go to the About page on your phone/tablet and click Reload.
  7. Go back to the appropriate area on your phone/tablet and see that the person has appeared.

This is a blocker for the 2.14.0 release.

garethbowen commented 6 years ago

I hoped this was just a UI update problem but by inspecting the requests I can see it's definitely not replicating the person doc, just the report. Also it works some times and fails others so my current thinking is it's some sort of race condition caused by too many updates all happening almost at the same time (report added, report transitioned, person added).

I'll come back to this soon...

derickl commented 6 years ago

Some related information: @garethbowen

Scenario: (App Workflows)

  1. Person related tasks where information is embedded in a contact doc.
  2. Set some property X which when true should trigger a task Y
  3. Task Y appears
  4. Modify the contact, set X to false and task Y still appears
  5. Refresh the UI, task Y disappears

Something of note is that, if task Z were to be triggered if X were false (and X was initially true), both tasks Y and Z appeared on the UI. Task Y disappears when the UI is refreshed.

sglangevin commented 6 years ago

@derickl what version were you testing on? 2.14.0? Wondering if this bug exists in older versions and we didn't catch it.

garethbowen commented 6 years ago

@derickl I think that's a different issue. The behaviour you describe is what I would expect if the configuration doesn't emit the task when X is false. Instead what it needs to do is emit the same task with the same ID but set resolved: true so the UI knows to hide it. If that's not the problem, would you mind creating a new issue in medic-projects or catching me on Slack sometime to go over it?

garethbowen commented 6 years ago

I think I've found the issue...

  1. Two changes come in at roughly the same time. One is an updated doc the user is allowed to see. The other is a new doc that they would be able to see but we haven't updated the list of validated ids that they're allowed to see yet.
  2. The user's couchdb changes request responds with the updated doc and a new last_seq of the latest sequence in the database.
  3. The browser stores the updated doc correctly and updates the replication checkpoint to the last_seq.
  4. The browser then requests the changes feed again, this time with a since parameter of the last_seq above.
  5. Because the since parameter is later in the changes sequence than the new doc the new doc is not returned.

If I'm right this affects more than just this use case but we've probably gotten away with it because it only occurs with changes that happen in quick succession.

This is an intrinsic problem with how we're doing replication and I can't think of a small fix that doesn't still have some sort of race between the changes feed which updates the list of uuids the user is allowed to see and the changes feed which is listening to the old list of uuids they can already see.

However I think we can change the algorithm as follows...

  1. The browser makes a longpoll changes request since their last known sequence.
  2. API makes a one shot (ie: not-longpoll) changes request including docs and using the since from step 1.
  3. Once couchdb responds, API filters the changed docs to ones the user is allowed to see. This requires a new function to be implemented which when given a user and a doc returns true if the user can see it.
    • If there are any changes left: the filtered response is forwarded back to the browser. The browser then writes the changes and then starts again at step 1 with a fresh since parameter.
    • If there are no changes left (or no changes at all since the given sequence): no response is made to the browser and the request is added to a changes request array.
  4. On startup API registers a single longpoll changes request to couchdb including docs. Every time a change is received API iterates through every request in the changes request array and filters the changes to those the user is allowed to see and if there are any responds to the browser (as in step 3).

The main downside of this is it's a significant rewrite of the changes handler with some major unknowns including correctness and performance.

The upsides are...

@SCdF @alxndrsn If you have the time I'd love a second (and third) pair of eyes on my proposal before I start coding.

derickl commented 6 years ago

@sglangevin @garethbowen I was testing this on 2.13.X

derickl commented 6 years ago

@garethbowen From this

The behaviour you describe is what I would expect if the configuration doesn't emit the task when X is false. Instead what it needs to do is emit the same task with the same ID but set resolved: true so the UI knows to hide it.

I think this would work but it would require relaxing the conditions to access the relevant task 'blocks' and add an extra condition to the resolved calculation.

I'll play around with this and feedback but I believe reports do not exhibit this behaviour.

SCdF commented 6 years ago

@garethbowen in step 2, is there one request api -> couchdb for every person responding? If so this feels a lot like filtered replication again: the filter code has to run over the same document UserCount number of times. It's unclear to me how well this will scale. We won't have the couchdb <--> js process perf issue that filtered replication has, but we're still running the code many times more than we need to, and it's still one unfiltered replication for every replicating user.

Presumably there is also another step, where if a user does a longpoll for since=currentMaxSeq then we wait and don't do anything until another change comes in?

So I am wondering if there are strategies we can take to make it so we only run the code over a _id/_rev once.

The simplest strategy I can think of is:

Another idea would be an extension to your idea: for the function to return a collection of ids like the view currently does, and have that function memoized behind a LRU cache, don't include_docs and pull them in a separate request if the cache isn't warmed for that _id/_rev. This should mean that while we don't keep everything necessarily, we should hopefully be able to have a large enough cache that the last week or so of results can be held in memory and we don't have to pull the docs / re-run the code again.

I was considering suggesting we look at how the changes feed is constructed and consider writing our own entirely separate node app that manages that, by keeping running _changes JSON files for every user based on filtering, and then just returning the one they want at the right time. BUT, _changes rules are horribly complicated with 2.0, and I'm not sure we're desperate enough to work through that problem.

garethbowen commented 6 years ago

@SCdF

in step 2, is there one request api -> couchdb for every person responding?

Yes but it's not long poll, it's just there to catch them up to the latest sequence. Once they're caught up then they switch over to using a common changes request (ie: one request from api -> couchdb for all users). Either way we will be doing a lot of filtering in api and that's where my performance concerns are.

Presumably there is also another step, where if a user does a longpoll for since=currentMaxSeq then we wait and don't do anything until another change comes in?

Yes we could add that as a performance optimisation to take the request straight from step 1 to step 4. It would work without it, but it would be an unnecessary request to the database, that returns no results and ends up at 4 anyway.

When a user requests a _changes update we ask CouchDB for the changes without include_docs

I don't think you'd want to do this for longpoll requests because if you had 1000 concurrent users you would have 1000 longpoll requests to couchdb that get responses every time a change happens and then have to reestablish the connection. That's why I was treating the catch-up and initial replications separately from the longpoll replications.

I like your idea about using a permanent view to do some data extraction only once and I think that can be incorporated into my algorithm to reduce the load on API.

have that function memoized behind a LRU cache

Yeah that's pretty sweet. LRU isn't quite right because someone doing initial replication would blow it away but some sort of LRU+LFU combination would be good. I think this is an optimisation we can add later if it turns out the new implementation is too slow.

Ok, well, I think I'll push forward with the algorithm with the change of using the existing view(s) instead of a pure js function basically as an easy form of caching. First step is to write some integration tests so I have some confidence that I'm not going to break everything completely!

SCdF commented 6 years ago

I don't think you'd want to do this for longpoll requests

Sorry yeah my example was really just for catchup, not for longpoll. Since longpoll returns after one document then you can just have one longpoll to couchdb and hand it to the people who are listening to care.

So we can talk Mon / Tuesday, but what is the final algo you're going with?

garethbowen commented 6 years ago

Pretty much what I proposed initially but using views more heavily to essentially cache the results of the functions. I'm going to have a stab at it today and we can talk about it when you get on in about 12 hours.

garethbowen commented 6 years ago

Code review please @SCdF

I ended up not implementing the above algorithm and instead came up with a simpler workaround. This will reduce performance, but closes the window where a change with a new ID can be skipped.

If this passes code review feel free to leave the merge and backport to me.

SCdF commented 6 years ago

Back to you @garethbowen , comments on the PR

garethbowen commented 6 years ago

Merged and backported to 2.14.x

sglangevin commented 6 years ago

This is working in 2.14.0-beta.10. Will continue with more release testing to see if I find any other issues with the changes to replication, hopefully not!

@derickl I think what you've described is a different issue. If it is persisting, please open another issue.