medic / cht-core

The CHT Core Framework makes it faster to build responsive, offline-first digital health apps that equip health workers to provide better care in their communities. It is a central resource of the Community Health Toolkit.
https://communityhealthtoolkit.org
GNU Affero General Public License v3.0
468 stars 217 forks source link

Upgrade testing: Migration failure in `extract-person-contacts` due to missing `name` property #3197

Closed ghost closed 7 years ago

ghost commented 7 years ago

While testing upgrades on the KFN Ilam instance, I encountered a migration failure:

2017-03-04T16:15:21.445Z - error: Fatal error running database migrations
Migration "extract-person-contacts" failed with: {"message":"Failed to update
parent on facility <$UUID> - {\n  \"code\": 400,\n
\"message\": \"Place <$UUID> is missing a
\\\"name\\\" property.\"\n}","stack":"Error: Failed to update parent on
facility <$UUID> - {\n  \"code\": 400,\n
\"message\": \"Place <$UUID> is missing a
\\\"name\\\" property.\"\n}\n
at /srv/storage/gardener/data/working_dir/aHR0cDovL2xvY2FsaG9zdDo1OTg0L21lZGljL19kZXNpZ24vbWVkaWMvbWVkaWMtYXBp/node_modules/medic-api/migrations/extract-person-contacts.js:259:29\n
at /srv/storage/gardener/data/working_dir/aHR0cDovL2xvY2FsaG9zdDo1OTg0L21lZGljL19kZXNpZ24vbWVkaWMvbWVkaWMtYXBp/node_modules/medic-api/controllers/places.js:220:7\n
at /srv/storage/gardener/data/working_dir/aHR0cDovL2xvY2FsaG9zdDo1OTg0L21lZGljL19kZXNpZ24vbWVkaWMvbWVkaWMtYXBp/node_modules/medic-api/node_modules/async/dist/async.js:3694:9\n
at/srv/storage/gardener/data/working_dir/aHR0cDovL2xvY2FsaG9zdDo1OTg0L21lZGljL19kZXNpZ24vbWVkaWMvbWVkaWMtYXBp/node_modules/medic-api/node_modules/async/dist/async.js:359:16\n
at iterateeCallback (/srv/storage/gardener/data/working_dir/aHR0cDovL2xvY2FsaG9zdDo1OTg0L21lZGljL19kZXNpZ24vbWVkaWMvbWVkaWMtYXBp/node_modules/medic-api/node_modules/async/dist/async.js:861:17)\n
at /srv/storage/gardener/data/working_dir/aHR0cDovL2xvY2FsaG9zdDo1OTg0L21lZGljL19kZXNpZ24vbWVkaWMvbWVkaWMtYXBp/node_modules/medic-api/node_modules/async/dist/async.js:843:16\n
at /srv/storage/gardener/data/working_dir/aHR0cDovL2xvY2FsaG9zdDo1OTg0L21lZGljL19kZXNpZ24vbWVkaWMvbWVkaWMtYXBp/node_modules/medic-api/node_modules/async/dist/async.js:3691:13\n
at apply (/srv/storage/gardener/data/working_dir/aHR0cDovL2xvY2FsaG9zdDo1OTg0L21lZGljL19kZXNpZ24vbWVkaWMvbWVkaWMtYXBp/node_modules/medic-api/node_modules/async/dist/async.js:21:25)\n
at /srv/storage/gardener/data/working_dir/aHR0cDovL2xvY2FsaG9zdDo1OTg0L21lZGljL19kZXNpZ24vbWVkaWMvbWVkaWMtYXBp/node_modules/medic-api/node_modules/async/dist/async.js:56:12\n
at /srv/storage/gardener/data/working_dir/aHR0cDovL2xvY2FsaG9zdDo1OTg0L21lZGljL19kZXNpZ24vbWVkaWMvbWVkaWMtYXBp/node_modules/medic-api/controllers/places.js:207:18"}

This failure appears to recur more or less continually (same document UUIDs), and appears to prevent medic-api from ever coming back up post-upgrade.

To do: figure out a way that we can either (i) comprehensively prescreen instances for situations that will cause migration failures; or (ii) log migration exceptions/failures in a way that allows the update process to complete.

garethbowen commented 7 years ago

(i) comprehensively prescreen instances for situations that will cause migration failures

That's why we clone and run the migrations before running on production. I can't think of any way to test the migrations other than actually running them.

(ii) log migration exceptions/failures in a way that allows the update process to complete.

We decided to block api startup on migration failure because the data is in an unknown state. I still believe this is the correct approach.

garethbowen commented 7 years ago

The problem here isn't really with the migration. We got much more aggressive about data validations which means if we make any change to an old doc we run the risk of getting this error. To fix this we could either ease back on the data validations, modify this migration to ensure validations pass, or write a new migration to modify all docs to pass the new validation regime (with an early date so it runs before the other migrations).

I suspect this will be a widespread problem.

mandric commented 7 years ago

I think this is a problem with the migration because it's not conforming to current validation standards that are defined in the medic-api lib. extract-person-contacts.js should be updated to include name properties on places so the places are valid.

ghost commented 7 years ago

I think the problem actually is with the migration.

At https://github.com/medic/medic-api/blob/master/migrations/extract-person-contacts.js#L108, we use restoreContact to try to "put back" an original document after a write failure. In restoreContact, places.updatePlace is used, which in turn applies constraints at https://github.com/medic/medic-api/blob/master/controllers/places.js#L169.

After this double-failure, we lose track of work: subsequent runs of the migration produce errors on different documents. Here's a quick log:

$ node server.js 
CouchDB Version: { major: '1', minor: '6', patch: '1' }
Updating docs: _design/medic-client,appcache
DDoc extraction completed successfully
Updating settings with new defaults
Configuration loaded successfully
Translations merged successfully
Detected translations change - reloading
Running migration "drop-couchmark-db"...
Migration "drop-couchmark-db" completed successfully
Running migration "extract-person-contacts"...
Fatal error initialising medic-api Migration "extract-person-contacts" failed with: {"message":"Failed to update parent on facility 2ac59a1a-431a-3eba-16213d23865fc31d - ... }

$ node server.js
CouchDB Version: { major: '1', minor: '6', patch: '1' }
Updating docs: _design/medic-client
DDoc extraction completed successfully
Configuration loaded successfully
Translations merged successfully
Running migration "extract-person-contacts"...
Failed to restore contact on facility 2ac59a1a-431a-3eba-16213d23865fc31d

$ node server.js
CouchDB Version: { major: '1', minor: '6', patch: '1' }
DDoc extraction completed successfully
Configuration loaded successfully
Translations merged successfully
Running migration "extract-person-contacts"...
Fatal error initialising medic-api Migration "extract-person-contacts" failed with: {"message":"Failed to update parent on facility 2ac59a1a-431a-3eba-16213d2386644ca2 - ... }

Note the different document ID on the final run.

I propose that we (i) improve restoreContact so that it simply puts the document back without applying unnecessary validations (we know it's unchanged); and (ii) substitute in a suitable placeholder string for the place's name property, prior to writing it, if it's not found.

ghost commented 7 years ago

As an alternative to proposal two, we could just fix up names prior to upgrading. I have a one-off script that adds missing names and can adapt it into a migration, but need to do some additional testing to make sure it solves the whole problem. Will report back shortly.

I do think we should address proposal one so that extract-person-contacts can be safely re-run after an error.

ghost commented 7 years ago

Okay, fixing up documents beforehand looks viable. The following war crime of a script gets us all of the way through migrations on a few test instances:

#!/bin/sh

post() {
  curl -H 'Content-Type: application/json' "$@"
}

view="{
  \"map\": \"function(doc) {
    if (doc.type != 'clinic'
        && doc.type != 'health_center'
        && doc.type != 'district_hospital') {
      return;
    }
    if (doc.name === null ||
        doc.name === undefined ||
        (typeof doc.contact == 'object' &&
          (doc.contact.name === null || doc.contact.name === undefined))) {
      emit(doc._id);
    }
  }\"
}"

filter='{
  "docs": .rows
    | map(.doc)
    | map({ "name": "Untitled" } + (
      if .contact then
        . + { "contact": ({ "name": "Unnamed" } + .contact) }
      else
        .
      end
    ))
}'

echo "$view" | tr -d '\r\n' \
  | post -d@/dev/stdin "$COUCH_URL/_temp_view?include_docs=true" \
  | jq "$filter" | post -d@/dev/stdin "$COUCH_URL/_bulk_docs"

I can obviously rewrite this as an actual migration script written in Javascript; this is literally stolen from a console window.

ghost commented 7 years ago

Bumping into code review. @garethbowen: do you think this is an okay way to go, or do you have a strong preference for packaging this up as a real migration?

garethbowen commented 7 years ago

@browndav That's fine as is as we shouldn't need it again. In future I think write scripts in node makes them easier to read, more cross platform, and it's easier to use 3rd party libraries, but I'll give you a pass just this once ;)

mandric commented 7 years ago

FFR here's a manual migration I wrote, I think it makes sense how I combine shell and node here: https://github.com/medic/medic-api/tree/0.4.x/migrations/add-uuid-to-scheduled-tasks