Closed sglangevin closed 6 years ago
I hoped this was just a UI update problem but by inspecting the requests I can see it's definitely not replicating the person doc, just the report. Also it works some times and fails others so my current thinking is it's some sort of race condition caused by too many updates all happening almost at the same time (report added, report transitioned, person added).
I'll come back to this soon...
Some related information: @garethbowen
Scenario: (App Workflows)
Something of note is that, if task Z were to be triggered if X were false (and X was initially true), both tasks Y and Z appeared on the UI. Task Y disappears when the UI is refreshed.
@derickl what version were you testing on? 2.14.0? Wondering if this bug exists in older versions and we didn't catch it.
@derickl I think that's a different issue. The behaviour you describe is what I would expect if the configuration doesn't emit the task when X is false. Instead what it needs to do is emit the same task with the same ID but set resolved: true
so the UI knows to hide it. If that's not the problem, would you mind creating a new issue in medic-projects or catching me on Slack sometime to go over it?
I think I've found the issue...
last_seq
of the latest sequence in the database.last_seq
.since
parameter of the last_seq
above.since
parameter is later in the changes sequence than the new doc the new doc is not returned.If I'm right this affects more than just this use case but we've probably gotten away with it because it only occurs with changes that happen in quick succession.
This is an intrinsic problem with how we're doing replication and I can't think of a small fix that doesn't still have some sort of race between the changes feed which updates the list of uuids the user is allowed to see and the changes feed which is listening to the old list of uuids they can already see.
However I think we can change the algorithm as follows...
since
their last known sequence.since
from step 1.since
parameter.The main downside of this is it's a significant rewrite of the changes handler with some major unknowns including correctness and performance.
The upsides are...
@SCdF @alxndrsn If you have the time I'd love a second (and third) pair of eyes on my proposal before I start coding.
@sglangevin @garethbowen I was testing this on 2.13.X
@garethbowen From this
The behaviour you describe is what I would expect if the configuration doesn't emit the task when X is false. Instead what it needs to do is emit the same task with the same ID but set resolved: true so the UI knows to hide it.
I think this would work but it would require relaxing the conditions to access the relevant task 'blocks' and add an extra condition to the resolved calculation.
I'll play around with this and feedback but I believe reports do not exhibit this behaviour.
@garethbowen in step 2, is there one request api -> couchdb for every person responding? If so this feels a lot like filtered replication again: the filter code has to run over the same document UserCount number of times. It's unclear to me how well this will scale. We won't have the couchdb <--> js process perf issue that filtered replication has, but we're still running the code many times more than we need to, and it's still one unfiltered replication for every replicating user.
Presumably there is also another step, where if a user does a longpoll for since=currentMaxSeq then we wait and don't do anything until another change comes in?
So I am wondering if there are strategies we can take to make it so we only run the code over a _id/_rev
once.
The simplest strategy I can think of is:
include_docs
Another idea would be an extension to your idea: for the function to return a collection of ids like the view currently does, and have that function memoized behind a LRU cache, don't include_docs
and pull them in a separate request if the cache isn't warmed for that _id/_rev
. This should mean that while we don't keep everything necessarily, we should hopefully be able to have a large enough cache that the last week or so of results can be held in memory and we don't have to pull the docs / re-run the code again.
I was considering suggesting we look at how the changes feed is constructed and consider writing our own entirely separate node app that manages that, by keeping running _changes JSON files for every user based on filtering, and then just returning the one they want at the right time. BUT, _changes rules are horribly complicated with 2.0, and I'm not sure we're desperate enough to work through that problem.
@SCdF
in step 2, is there one request api -> couchdb for every person responding?
Yes but it's not long poll, it's just there to catch them up to the latest sequence. Once they're caught up then they switch over to using a common changes request (ie: one request from api -> couchdb for all users). Either way we will be doing a lot of filtering in api and that's where my performance concerns are.
Presumably there is also another step, where if a user does a longpoll for since=currentMaxSeq then we wait and don't do anything until another change comes in?
Yes we could add that as a performance optimisation to take the request straight from step 1 to step 4. It would work without it, but it would be an unnecessary request to the database, that returns no results and ends up at 4 anyway.
When a user requests a _changes update we ask CouchDB for the changes without include_docs
I don't think you'd want to do this for longpoll requests because if you had 1000 concurrent users you would have 1000 longpoll requests to couchdb that get responses every time a change happens and then have to reestablish the connection. That's why I was treating the catch-up and initial replications separately from the longpoll replications.
I like your idea about using a permanent view to do some data extraction only once and I think that can be incorporated into my algorithm to reduce the load on API.
have that function memoized behind a LRU cache
Yeah that's pretty sweet. LRU isn't quite right because someone doing initial replication would blow it away but some sort of LRU+LFU combination would be good. I think this is an optimisation we can add later if it turns out the new implementation is too slow.
Ok, well, I think I'll push forward with the algorithm with the change of using the existing view(s) instead of a pure js function basically as an easy form of caching. First step is to write some integration tests so I have some confidence that I'm not going to break everything completely!
I don't think you'd want to do this for longpoll requests
Sorry yeah my example was really just for catchup, not for longpoll. Since longpoll returns after one document then you can just have one longpoll to couchdb and hand it to the people who are listening to care.
So we can talk Mon / Tuesday, but what is the final algo you're going with?
Pretty much what I proposed initially but using views more heavily to essentially cache the results of the functions. I'm going to have a stab at it today and we can talk about it when you get on in about 12 hours.
Code review please @SCdF
I ended up not implementing the above algorithm and instead came up with a simpler workaround. This will reduce performance, but closes the window where a change with a new ID can be skipped.
If this passes code review feel free to leave the merge and backport to me.
Back to you @garethbowen , comments on the PR
Merged and backported to 2.14.x
This is working in 2.14.0-beta.10. Will continue with more release testing to see if I find any other issues with the changes to replication, hopefully not!
@derickl I think what you've described is a different issue. If it is persisting, please open another issue.
Seen on standard.dev on 2.14.0-beta.9
To reproduce:
This is a blocker for the 2.14.0 release.