Closed ghost closed 7 years ago
(i) comprehensively prescreen instances for situations that will cause migration failures
That's why we clone and run the migrations before running on production. I can't think of any way to test the migrations other than actually running them.
(ii) log migration exceptions/failures in a way that allows the update process to complete.
We decided to block api startup on migration failure because the data is in an unknown state. I still believe this is the correct approach.
The problem here isn't really with the migration. We got much more aggressive about data validations which means if we make any change to an old doc we run the risk of getting this error. To fix this we could either ease back on the data validations, modify this migration to ensure validations pass, or write a new migration to modify all docs to pass the new validation regime (with an early date so it runs before the other migrations).
I suspect this will be a widespread problem.
I think this is a problem with the migration because it's not conforming to current validation standards that are defined in the medic-api lib. extract-person-contacts.js
should be updated to include name
properties on places so the places are valid.
I think the problem actually is with the migration.
At https://github.com/medic/medic-api/blob/master/migrations/extract-person-contacts.js#L108, we use restoreContact
to try to "put back" an original document after a write failure. In restoreContact
, places.updatePlace
is used, which in turn applies constraints at https://github.com/medic/medic-api/blob/master/controllers/places.js#L169.
After this double-failure, we lose track of work: subsequent runs of the migration produce errors on different documents. Here's a quick log:
$ node server.js
CouchDB Version: { major: '1', minor: '6', patch: '1' }
Updating docs: _design/medic-client,appcache
DDoc extraction completed successfully
Updating settings with new defaults
Configuration loaded successfully
Translations merged successfully
Detected translations change - reloading
Running migration "drop-couchmark-db"...
Migration "drop-couchmark-db" completed successfully
Running migration "extract-person-contacts"...
Fatal error initialising medic-api Migration "extract-person-contacts" failed with: {"message":"Failed to update parent on facility 2ac59a1a-431a-3eba-16213d23865fc31d - ... }
$ node server.js
CouchDB Version: { major: '1', minor: '6', patch: '1' }
Updating docs: _design/medic-client
DDoc extraction completed successfully
Configuration loaded successfully
Translations merged successfully
Running migration "extract-person-contacts"...
Failed to restore contact on facility 2ac59a1a-431a-3eba-16213d23865fc31d
$ node server.js
CouchDB Version: { major: '1', minor: '6', patch: '1' }
DDoc extraction completed successfully
Configuration loaded successfully
Translations merged successfully
Running migration "extract-person-contacts"...
Fatal error initialising medic-api Migration "extract-person-contacts" failed with: {"message":"Failed to update parent on facility 2ac59a1a-431a-3eba-16213d2386644ca2 - ... }
Note the different document ID on the final run.
I propose that we (i) improve restoreContact
so that it simply puts the document back without applying unnecessary validations (we know it's unchanged); and (ii) substitute in a suitable placeholder string for the place's name
property, prior to writing it, if it's not found.
As an alternative to proposal two, we could just fix up names prior to upgrading. I have a one-off script that adds missing names and can adapt it into a migration, but need to do some additional testing to make sure it solves the whole problem. Will report back shortly.
I do think we should address proposal one so that extract-person-contacts
can be safely re-run after an error.
Okay, fixing up documents beforehand looks viable. The following war crime of a script gets us all of the way through migrations on a few test instances:
#!/bin/sh
post() {
curl -H 'Content-Type: application/json' "$@"
}
view="{
\"map\": \"function(doc) {
if (doc.type != 'clinic'
&& doc.type != 'health_center'
&& doc.type != 'district_hospital') {
return;
}
if (doc.name === null ||
doc.name === undefined ||
(typeof doc.contact == 'object' &&
(doc.contact.name === null || doc.contact.name === undefined))) {
emit(doc._id);
}
}\"
}"
filter='{
"docs": .rows
| map(.doc)
| map({ "name": "Untitled" } + (
if .contact then
. + { "contact": ({ "name": "Unnamed" } + .contact) }
else
.
end
))
}'
echo "$view" | tr -d '\r\n' \
| post -d@/dev/stdin "$COUCH_URL/_temp_view?include_docs=true" \
| jq "$filter" | post -d@/dev/stdin "$COUCH_URL/_bulk_docs"
I can obviously rewrite this as an actual migration script written in Javascript; this is literally stolen from a console window.
Bumping into code review. @garethbowen: do you think this is an okay way to go, or do you have a strong preference for packaging this up as a real migration?
@browndav That's fine as is as we shouldn't need it again. In future I think write scripts in node makes them easier to read, more cross platform, and it's easier to use 3rd party libraries, but I'll give you a pass just this once ;)
FFR here's a manual migration I wrote, I think it makes sense how I combine shell and node here: https://github.com/medic/medic-api/tree/0.4.x/migrations/add-uuid-to-scheduled-tasks
While testing upgrades on the KFN Ilam instance, I encountered a migration failure:
This failure appears to recur more or less continually (same document UUIDs), and appears to prevent
medic-api
from ever coming back up post-upgrade.To do: figure out a way that we can either (i) comprehensively prescreen instances for situations that will cause migration failures; or (ii) log migration exceptions/failures in a way that allows the update process to complete.