medic / cht-conf

A command-line interface for configuring Community Health Toolkit applications
https://communityhealthtoolkit.org
GNU Affero General Public License v3.0
22 stars 25 forks source link

extending form versions with form history #505

Open bamatic opened 2 years ago

bamatic commented 2 years ago

Is your feature request related to a problem? Please describe. the cht-conf version 10 add the xml_version json key to the form doc when the action upload-app-forms is executed, but the new form of a given form._id replaces the old one and the old xml file is overwritten. The cht add the form version of the form doc to its data_records. If a data analyst found a trouble with a current form, he can always learn about collected data using the form, but if the problem is with historical old data, he knows fields sent into the data_record but he is not able to investigate data collection processus and for example, learn about how such a value is possible in a variable;

Describe the solution you'd like We propose that upload-app-forms does not replace old version but :

  1. before to the replacing, the old form is copied to a new doc but with type: "old-form" and the _id of replaced form is stored in the doc with type "old-doc" using the key "doc-id";=:"form:formId", attachements, xml, form_html and model are saved to this type:"old-doc" new doc
  2. the form is replaced as now

Additional context some endpoints could be added to the cht-core forms API to retrieve all versions of a given form:formID; and to get the xml file of a given form version something like GET api/v1/forms/{formid}.{format]/{sha1}

garethbowen commented 2 years ago

@bamatic Thanks for raising this. Can I ask what you want to use the old form for?

bamatic commented 2 years ago

@garethbowen With the version system, we have the version of the form with which the data was generated, but not the form itself.

we know this data_record was generated with a different form than this one, but we don't have the different forms used with each version. All the fields of a given form, won't be present in a data_record, but if we have the form definition we now all the fields, the type and we can be sure that a given table/view/matview has all needed columns for a given version.

in addition to this, even if the normal use case of the xml Odk form file is to build the enketo html form interface, the xml file of the form can be used by the data analyst , data engineers in order to understand and analyze the data.

for example, we have a data_record with form="anc_followup" marked with the form_version.sha256="x" and reported in 2018 year, we want to know about this data_record, there are some values that we do not understand, they are not compatible with the actual form definition, maybe the data is even incompatible with the actual protocol, so we look to the form doc._id="form:anc_followup" but we find that the xml_version.sha256="y", so the form._attachments.xml, that contains fields definitions is not the one that was used to generate and send our data_record

So data analyst will look for old-forms of this form, the docs with type:"old-form" and doc_id:"form:anc_followup" and select the one that has xml_version.sha256=="x" so the xml attachments of this old-form contains all form fields, all question labels, the skip logic and even the select1 choices avaliables at the moment of the generation of the concerned data_record; this is very helpful for our data analyst and even for coding the json extractions in the postgresql database

Sometimes a form question is changed from one question group to another, so we have to do coallesce(doc#>>'{fields,old_group,question}', doc#>>'{fields,new_group,question}') as question to extract correctly both data_records, data_records from first version and data_records from the second version. When the calculate formula changes and when variables are renamed we have similar cases. If we don't have both form definition we don't know if both variables named "question" has the same meaning or not, and so if we should coalesce them or keep as independent variables.

When a form is updated but the json extraction in the Postgresql view/matview is not, and we realize on this error later, we do not have the form definition of the first version, second, third version... but only the last one and the task become difficult to extract all versions data.

garethbowen commented 2 years ago

Thanks for the explanation! Some of these concepts also hold for other aspects of the configuration, not just forms. I expect most configurations are stored in some form of change management repository so that every version of the form and config is available for this sort of analysis. Is that the case with your configuration? If so, why is storing the form in couchdb better than going back to the source?

bamatic commented 2 years ago

Our use case: A data analyst does not undestand a given data in postgresql, she needs to find in the postgres view sql definition from which form variable comes each column in the view, find the form definition in github history, understand xlsform format...

In health care NGO, is very usual to have medical staff with the ability to analyse data with BI software but not always with development skills

We are using CHT forms api to build a dynamic cht forms data dictionnary. The main advantage of having the forms in couchdb is that with docs in couchdb, other applications can very easily consume apis like the cht api v1/forms or the couchdb api, and automatically cross form variable definitions with postgrsql view definitions, with this view definitions obtained also dynamicaly from postgresql server, to generate a kind of automatic data dictionary that presents in a simple web interface, for each form, every variable and its relation with the correspondig variable/column in postgresql, questions labels in every language are also presented in this web interface and select choices in distinct languages, this gives a lot of information and context without the need of going to posgresql and after look at github versions. This solution does not need human maintenance, to add descriptions to variablse or whatever all is dynamically updated from the API. Using the GitHub API or other repo api should be possible, but our cht forms documentation tool would depend on the repo API instead of on the CHT API

We are in the point to finish a first version of such a cht form data catalog system and adding to it old forms information would be realy cool the users of this application are able to know from which form variable comes every postgresql column without the need to manually search form definition in the config repo and without the need to look for json data extraction in views definition sql files

ohter use case: when ingesting data from couchdb change feed to data warehouse, if we have a non nested json doc that is sent to a table that has a column for each json key, ingestion is automatic, only need to send the flattened json to a pub/sub server and google bigquery table ingest it automatically. If the software that is consomming the change feeds finds a data record with a version that do not have a correspondig table, maybe a new form version, the software could get the form associated with this version dynamically from couchdb and use this to create the corresponding table on the fly with te correct schema, a kind of "flattening on the fly in ingestion time"

In muso's largest database we have 51 forms installed, some of them has been modified 50 times, form 2018, the number of docs old-form won't be very high ....

Another possibility, could be, to add another cli options, --save-current-forms and only users that explicitily use this options will save the old forms

In our repo https://github.com/medic/config-muso file forms/app/anc_followup.xml the oldest version i'm able to find is from 03/20/20 02c61d539c264b316200e66b2a009e92df610d25 and this data record reported in 2017 https://muso-mali.app.medicmobile.org/medic/3D9FA773-ACFD-4062-BEE4-DC4C6DB25DC3

What about if a commit is merged in master at 2022-05-05 but by X reason, master is not deployed to production at this date. A data_record from 2022-05-15 would be intrepreted with the form of 2022-05-05 but this form had not been used to send the concerned data record.

less realistic case, a developper deploy with cht-conf from his own machine without previous github repo commit.

Having repository history and couchdb history can be, for me, helpfull and shouldn't have a lot of impact in the database, no one wants to continuously change form version in production and thus have thousand of old-forms for every form.

Yes certainly not only forms need a history, but with data_records and form versions is very very usual to need this kind of analyse.

garethbowen commented 2 years ago

We are using CHT forms api to build a dynamic cht forms data dictionnary.

I'm really interested in this concept because it might be really useful to other projects too. How do you get the form fields out of the report? Something like xpath over the xform attachment? Is this a step that could be useful in cht-conf too, so the form automatically has a JSON property, or an attachment, which lists all defined data fields? Do you have some code I could look at?

bamatic commented 2 years ago

by now, this works with the currently installed forms, I mean if we restrict ourselves to the forms that are installed at this moment in our cht instance, from the xml attachments of the doc type:form, we are parsing the xml file and comparing the xpath of each question with the extractions dones in the corresponding postgresql view, doc#>>{fields, group1, group2,variable} as my_col, this view must be named formview_${form_id},( in addition we have in google bigquey a table with the same name, and in google drive a diagram with the same name.)

this {fields, group1, group2,variable} as my_col, correspond to a xpath /form_id/group1/group2/variable, so we extract information of the variable from the xml file like, relevant, required, calculate etc etc and we related each variable with the corresponding column in Postgresql (and in our organization with BigQuery)

When coallesce is present in postgresql, two or more differents xform variables correspond to the same postgresql column, and some postgresql columns do not correspond to any xform variable, like reported_date, latitud and a few others

we get the xml file dynamically calling to the cht api api/v1/forms/{form_id}.xml and the view definition perfoming a sql query to the postgresql server. (In our use case we call BigQuery API and Google Drive Api to get table information and links to google drive documents that are presented to the user in the same interface)

Let me see if i can get an access for you to our repo, but the code is really a very alpha version and not very clear ...

in very complex postgresql extractions, with CTE like, WITH cte_name (column_list) AS ( CTE_query_definition ) I ve difficulties to relate properly xform variables with sql columns, but since in almost all formviews the extractions are like #>> {q,a,a,a,b } globally works, but I need to improve things...

Now if we finaly go to 3.15 and using the cht-conf versioning system, I've been thinking that for every form version we could do the same thing, extract form fields from the xml xform file and compare old versions to the current postgresql view, since the current postgresql view is extracting current and former data records, but we would need to have in couchdb the old-forms to use the same approach