ushahidi / platform

Ushahidi Platform API version 3+
http://ushahidi.com
Other
679 stars 506 forks source link

SMS alerts aren't sending even though it's all configured #1655

Closed jshorland closed 7 years ago

jshorland commented 7 years ago

I'm not receiving SMS alerts from usaelectionmonitor.ushahidi.io

I'm assuming this is happening everywhere

rjmackay commented 7 years ago

Hrm. The queue is huge again and seems stalled.

rjmackay commented 7 years ago

I've purged the job queue, and increased the number of workers slightly which might help.. However this seems to be a running issue. Brainstorming solutions with @willdoran

We have a recurring issue with the .io dataproviders queue backing up.. I think its just not processing jobs as fast as the come in. My initials thought are:

  1. remove ansible as the middleman between celery + php
  2. Try to only run jobs for deployments which have providers enabled. Sadly that easier said than done: we have to query the platform DB to check status.
    • I could expose an API to update dataproviders for a specific deployment
  3. Just add more workers/make workers work under higher load .. but we probably max out db connections eventually

..

  1. Seems like it would reduce an overhead cost
  2. Could we cache the results to a local table deployments_using_datasources and just check every X amount of time and then update that local table
  3. This one depends I think on whether we are just about serving or if the workers are way behind.
    • I think we're only just short. I've made a small tweak to the number of workers.. but it builds up to 500k jobs stuck in the queue
    • If we are just short then tuning up might work to get some stability. I think definitely running the jobs only for the instances that use datasources would probably be a dramatic load reduction.

Once we have laravel, and rewrite data providers.. we could make deployments use a shared queue properly and throw away rabbitmq. It gets much better once we have just 1 db.. but even before that we can simplify. So optimizing rabbitmq right now will get thrown away. But moving things to laravel (in cloud interface) will last longer.. similarly tweak the dataproviders in platform is probably more worthwhile.

rjmackay commented 7 years ago

This might be temporarily resolved.. but it needs a permanent fix. I was testing some improvements to the celery task but its friday afternoon so I think I'll have to reconfirm on Monday

rjmackay commented 7 years ago

Ah.. so I did see the queue just from 0 to 10k a few times. I realized this is because the dataprovider.generate task goes into the same queue.. if the queue is overloaded then celerybeat queues up multiple generate tasks. When they finally fun they queue up WAY more tasks thus overloading the queue some more.

We can reduce the frequency we run at, or we need some rate limiting mechanism on celerybeat I think

rjmackay commented 7 years ago

@tuxpiper could you reduce the frequency of the dataprovider task for now and purge the dataprovider queue? that should keep this functioning till Monday

tuxpiper commented 7 years ago

Done that. Let's hope it holds over the weekend.

rjmackay commented 7 years ago

Deploying the change to remove ansible from dataproviders task. Will check how that speeds up the process.

rjmackay commented 7 years ago

Dropping ansible takes us back to a 15min run time.. still need to optimize I think.

tuxpiper commented 7 years ago

How about adding a command mode to the ushahidi command line tool? That would cut down the repeated work of forking php processes and bootstrapping the platform every time. i.e.:

$ ./bin/ushahidi command
stdin< { "DB_HOST"="...", "DB_NAME"="...", "DB_USER"="...", "DB_PASSWORD"="...", "command": "dataprovider incoming" }
stdout> { "result": "OK", "output": "..." }
stdin< < { "DB_HOST"="...", "DB_NAME"="...", "DB_USER"="...", "DB_PASSWORD"="...", "command": "dataprovider outgoing" }
stdout> { "result": "OK", "output": "..." }

Just not sure how hard it is to reconfigure the database connection on the fly.. within the same process.

jshorland commented 7 years ago

I'm still not receiving any SMS alerts from usaelectionmonitor.ushahidi.io, even though I'm set up to do so. I believe we can figure out this issue when we rewrite the datasource integrations as laid out in #697 -- a Q2 OKR non-negotiable.

tuxpiper commented 7 years ago

just noting here, we are not lagging that much behind on job execution now .. in the precise case of that deployment, it must be a configuration issue with the integration of the SMS gateway, or the gateway itself.