Send outbound push without delay

abbyad commented 4 years ago

What feature do you want to improve? Outbound push

Describe the improvement you'd like Sentinel currently schedules outbound push every five minutes. For some workflows, like messaging using RapidPro, it would be ideal if those pushes happened as soon as possible so that the creation of a report or contact could trigger a messaging workflow immediately for CHWs or patients.

Describe alternatives you've considered It is possible to manually reduce the schedule interval from 5 minutes to 2 minutes, but it is unclear what implications that has if a push doesn't complete before the schedule is run again. There could be scenarios when even 2 minutes is a long time to wait for the outbound push to complete, eg phone number verification.

Additional context Setting the interval to 1 minute caused errors and the outbound push didn't work. Also, having the push interval to be less than 2 minutes is possibly more important for testing and demo purposes than in real life. We don't have an actual use case or deployment needing it to be more immediate yet.

SCdF commented 4 years ago

Note on eng feasibility:

The logic for outbound is in sentinel/schedules, whereas the logic for the transition that would need to call that logic is in a shared lib. So some refactoring would need to be done to make that clean
There is some work coming up attached to #6116 that will be the first step to making our scheduler cleaner and more useful, on the way to allowing for cron-like configuration. This would still be at best once per minute, but it will solve the issues had with setting the schedule interval too low

isaacholeman commented 4 years ago

I just wanted to confirm that we have at least 4 funded implementations that would benefit from or require this:

Some options for messaging for covid response would be possible if the delay were shortened (covid response has unrestricted funding already, though next implementation site is not set).
There are now three replications of the initial VMMC workflow, in Zim, South Africa, and South Africa. CDC funds the first two, the third is a well-funded 5-year NIH study. All are led by I-TECH, with Medic's direct technical support.

With this set of projects in place, we have time and resources to get this right, yet at the same time the covid response is time sensitive. If there were a quicker way to bring the turnaround down to a minute or two, then we could buy ourselves more time (months at least) for a more instantaneous chat experience.

MaxDiz commented 4 years ago

^^ @ranjuts - I don't think this was included in the last prioritization round. May want to bump up the priority list

abbyad commented 4 years ago

We have been using the scheduler set to 1 minute in a test instance for the past couple of weeks, and it works well for outbound push. @SCdF what are the secondary effects of doing this? Since this relates to schedules outside of outbound push, what are the potential downsides or tradeoffs to consider?

SCdF commented 4 years ago

@abbyad this means that other schedules will run more frequently. This in theory shouldn't matter much, except that there may be more load on the server.

abbyad commented 4 years ago

Thanks, @SCdF, what are the other schedules and how might we can we get a sense of the additional load? Would you recommend just making outbound push as immediate as possible, and leaving the other schedules as they are?

SCdF commented 4 years ago

I think if we get outbound sending immediately I think setting the scheduler back to 5 minutes is reasonable? The only reason it will fail to send immediately is if:

We have misconfigured something and so it will fail each and every time
Their server is down / overloaded, and so giving it some time is probably a good idea

garethbowen commented 4 years ago

I agree. The ultimate solution is to run this as a schedule and a transition. This way the first attempt is made immediately and if that fails it'll be retried every 5 minutes. I don't think this is particularly challenging, it just means repackaging the code so it can be called by both.

isaacholeman commented 4 years ago

Cool, that makes a lot of sense. Will it be configurable whether or not to use the transition at all? We're generally going to want that for use cases that involve back and forth messaging, but I'd imagine we'd only want the scheduler for use cases that involve sharing of records, referrals etc.

garethbowen commented 4 years ago

I'd imagine we'd only want the scheduler for use cases that involve sharing of records, referrals etc.

What is the benefit of waiting up to 5 minutes to share records and referrals rather than attempting to share immediately?

isaacholeman commented 4 years ago

I had in mind @SCdF's comment from earlier in the thread:

this means that other schedules will run more frequently. This in theory shouldn't matter much, except that there may be more load on the server.

If we're not worried about performance or cost implications of more load on the server, then I wouldn't have any reason to care.

garethbowen commented 4 years ago

I think Stefan's comment was regarding Marc's workaround about setting the schedule period to 1 minute. Doing this causes every scheduled task to run every minute to check for any potential tasks in the database which can have performance impacts.

Using a transition means we don't have to configure the schedule to run as frequently as we will attempt to send the report immediately. This is much better for performance because invoking outbound push on a single report is much cheaper than invoking all schedules for all possible tasks.

abbyad commented 4 years ago

I have a good sense of what runs as transitions, but curious what we have running as schedules. Is there a list, like we have for transitions?

garethbowen commented 4 years ago

This is the source list: https://github.com/medic/cht-core/blob/master/sentinel/src/schedule/index.js#L8

garethbowen commented 4 years ago

Ready for AT in 6306-immediate-outbound - do at the same time as #6223

SCdF commented 4 years ago

AT instructions:

outbound pushes should send (or attempt to send) immediately. This will not create a task but it should update their info doc in the same way that the tasks currently do when they successfully complete
if they fail for any way (bad config, bad conversion, server is down or times out etc) the transition will succeed and move on to process other documents, but a task will be created
this task will be attempted again roughly every N minutes (default 5) and will keep getting reattempted until it succeeds

ngaruko commented 4 years ago

@brad1905 - Additional note on AT (for textit/rapidpro integration):

See outbound docs here.

2 . Check if "mark_for_outbound": true is in transitions , otherwise add that

{
  "transitions": {
    "mark_for_outbound": true
  }
}

Add outbound configuration property eg :

"outbound": {
    "textit-patient": {
        "relevant_to": "doc.type === 'person' && doc.phone",
        "destination": {
            "base_url": "https://textit.in",
            "auth": {
                "type": "header",
                "name": "Authorization",
                "value_key": "textit.in"
            },
            "path": "/api/v2/flow_starts.json"
        },
        "mapping": {
            "flow": {
                "expr": "'[texit-flow-number-for-the-actual-flow]'"
            },
            "urns": {
                "expr": "[ 'tel:' + doc.phone ]",
                "optional": true
            },
            "extra.patient._id": "doc._id",
            "extra.patient._rev": "doc._rev",
            "extra.patient.name": "doc.name",
            "extra.patient.date_of_birth": "doc.date_of_birth",
            "extra.patient.medic_id": "doc.patient_id"
        }
    }

Add medic-credentials to CouchDb -

(value = api token to be provided...)

Create a person with a phone number (this is the trigger as "relevant_to": "doc.type === 'person' && doc.phone",)
Two ways to verify : a) whatever the flow is meant to do in textit b) check in docker logs : docker exec -it medic-os /bin/bash less /srv/storage/medic-sentinel/logs/medic-sentinel.log (inside container) If successful logs will have: [2020-05-05 02:08:46] 2020-05-05 02:08:46 ESC[32mINFOESC[39m: Pushed 9b46c75d-d51e-4e7f-9530-af2718336d07 to textit-patient

Otherwise, there might be logs with some info on what went wrong such as:

[2020-05-05 01:40:00] 2020-05-05 01:40:00 ESC[31mERRORESC[39m: Failed to push 69
763819-beff-516a-bf7a-7a178f298fb4 to textit-patient: { Error: CouchDB config ke
y 'medic-credentials/textit.in' has not been populated. See the Outbound documen
tation.

brad1905 commented 4 years ago

@SCdF The fix for this issue has eliminated the delay but has caused another problem. Pushes are scheduled immediately instead of waiting 5 minutes, but 2 messages are sent instead of 1. See 2 pushes for one identical id 1 second apart in attached log file screenshots.

SCdF commented 4 years ago

So my guess is moving this code to happen in the transition has exposed a flaw / issue / oddity with Sentinel, which is that depending on what transitions you've configured it may end up executing over the same transition multiple times. For example, if you're pushing a patient you'll get one for the first write and then another after the patient_id is written.

So with the following config:

    "transitions": {
      "mark_for_outbound": true,
      "generate_patient_id_on_people": true
    },
    "outbound": {
      "stefan": {
        "relevant_to": "doc.type === 'person'",
        "destination": {
          "base_url": "http://localhost:3333",
          "path": "/api/v1/patient/records"
        },
        "mapping": {
          "id": "doc._id",
          "rev": "doc._rev",
          "name": "doc.name",
          "shortcode": "doc.patient_id"
        }
      }
    },

Will generate the following output to a mock server:

➜  _scratch node sih-mock.js
SIH mock up on 3333
2020-05-05T07:20:04.823Z records {
  host: 'localhost:3333',
  accept: 'application/json',
  'content-type': 'application/json',
  'content-length': '139',
  connection: 'close'
} {
  id: 'ec783bad-2b26-46c4-bea2-eff87bc523ba',
  rev: '1-a251490cc743d23aa0e14f04b9cc8e39',
  name: 'Multiple firings test',
  shortcode: '22366'
}
2020-05-05T07:20:04.929Z records {
  host: 'localhost:3333',
  accept: 'application/json',
  'content-type': 'application/json',
  'content-length': '139',
  connection: 'close'
} {
  id: 'ec783bad-2b26-46c4-bea2-eff87bc523ba',
  rev: '2-8c76be92939bf9b34cd230b103e61db8',
  name: 'Multiple firings test',
  shortcode: '22366'

The flow here is:

rev 1 is written by the user, which sentinel picks up
generate_patient_id_on_people runs, creates a shortcode, attaches it, and tells sentinel that it's written to the doc being worked on
mark_for_outbound runs, and sends an outbound request
Sentinel finishes with transitions, and since at least one transition said that it wrote to the doc it then writes that document back to CouchDB, creating rev 2
rev 2 is then picked up by Sentinel. Sentinel doesn't ignore this write because maybe changes in the first round has written to it in such a way that there are other transitions that can potentially run. Whether we need this is debatable.
generate_patient_id_on_people runs, but it can detect that it ran last time and it knows exactly what it does (adds a patient id) so it knows it doesn't have to run again
mark_for_outbound runs, and runs again.

The problem is that there isn't really any decent way for outbound to know if it should run again that works for all cases. Maybe the second revision is actually because the name changed? You may now start constructing elaborate de-duping solutions in your mind palace, but they can all fall over now or in the future depending on how things are configured, and is a much more complicated topic than 3.9 has time budget for.

Having outbound support sending the same document multiple times was a mostly arbitrary decision I made back in SIH days. The most obvious solution here is to just reverse that: outbound documents only send once, on their first revision, and any subsequent edits are not supported.

@abbyad how do you feel about that last paragraph (you may ignore everything else at your convenience!). It is probably worth doing a runaround users to make sure no one is relying on that (it is doubtful), though I am wondering if you already know enough to make that exercise unnecessary.

abbyad commented 4 years ago

Is there a way to identify from the outbound condition whether outbound has previously executed? If not, then I think running once it's better for now than running outbound blindly with any arbitrary edit of the doc.

This means we can't run outbound if a patient's phone number is edited, to verify it again. Running once it's better than duplicates, and I don't know of any project using this phone verification example, but worth nothing if that becomes a requirement in the future.

SCdF commented 4 years ago

Is there a way to identify from the outbound condition whether outbound has previously executed?

Is "the outbound condition" referring to the relevant_to function? If so, then no, because that function only has access to the doc, and the outbound execution log is against the doc's info doc.

Having said that, even if you did have access to it this wouldn't solve for:

This means we can't run outbound if a patient's phone number is edited, to verify it again.

Because knowing it has or hasn't run before doesn't let you know if you've already sent it for that particular phone number. Solving for this kind of thing requires a more complicated solution that is outside of the scope of 3.9.0 (unless we realise that we need to support both, in which case we might need to do something more complicated!), and there are several options we'd need to consider to best support this.

abbyad commented 4 years ago

Agreed, any more complex evaluation seems beyond the scope of this feature for 3.9.

SCdF commented 4 years ago

@brad1905 back to AT! In addition to other AT notes:

Outbound messages should only send once, per record / config combination, regardless of later edits

SCdF commented 4 years ago

Actually hold off on that: I'm not sure, pending a response from @derickl, that we can move forward with this plan

derickl commented 4 years ago

@SCdF

Just had a chat with @kitsao and we can work around the limitation of sending once.

garethbowen commented 4 years ago

Great - moving back to AT.

SCdF commented 4 years ago

OK I've fixed the merge conflicts and the latest build should now be testable, cc. @brad1905

SCdF commented 4 years ago

OK here are the complete AT instructions:

outbound pushes should send (or attempt to send) immediately. This will not create a task but it should update their -info doc in the same way that the tasks currently do when they successfully complete
if they fail for any way (bad config, bad conversion, server is down or times out etc) the transition will succeed and move on to process other documents, but a task will be created
this task will be attempted again roughly every N minutes (default 5) and will keep getting reattempted until it succeeds
Outbound messages should only send once, per record / config combination, regardless of later edits

brad1905 commented 4 years ago

AT has passed for desktop browser and Android app and this is ready to be merged

medic / cht-core

Send outbound push without delay #6306