Injecting mandatory FMTM fields into custom XLSForms

spwoodcock commented 3 months ago

Is your feature request related to a problem? Please describe.

Currently our XLSForm has hidden fields that are mandatory: feature / task selection, digitization_correct questions.
This is prone to breakage if the user uploads a custom XLSForm and accidentally deletes fields (despite being hidden).
We need a more resilient mechanism to allow for custom XLSForm upload flexbility, but also impose our restrictions / requirements that FMTM needs to function well.

Describe the solution you'd like

Option 1: append the fields in the XLSForm, either via pyxlsx or possibly via Pandas, prior to XLS --> XML.
- Load the XLSForm as a dataframe.
- Have pre-prepared separate XLSForms with building questions & digitization_correct questions.
- Read the building selection questions into a dataframe and append to start.
- Read the digitization_correct questions into a dataframe and append to end.
- However, this will introduce a big dependency: Pandas, which is not ideal.
Option 2: use the current XML parsing / manipulation method to inject all required params, after XLS --> XML.
- This method is possible, but may have many fields we need to insert into specific places.
- Needs testing and is more prone to breakage.

Additional context

This first option would probably require we update all form handling to use Pandas, which would be quite an improvement to XLS wrangling.

[!Note] See comment below. We should not use Pandas, but instead python-calamine.

spwoodcock commented 3 months ago

This would make the XLSForms a whole lot more maintainable.

Currently if we have say 10 categories = 10 XLSForms, we need to keep the extra fields we use in FMTM in sync for all of them.

Ideally we have:

1 XLSForm with mandatory fields for FMTM.
10 XLSForms with questions only related to the specific mapping category.
On project creation we merge the selected form with the 'FMTM mandatory fields' XLSForm.

spwoodcock commented 3 months ago

I just did a small assessment of package size between various XLS and XLSX parser libraries:

Pandas is nice to work with, but pulls in so many unnecessary libraries it's not worth it.
openpyxl and xlrd would be required together to parse XLSX and XLS files respectively.
python-calamine seems like the most sustainable choice (it was recently integrated into Pandas, so should have good community support). At the low level, it's actually using a compiled rust binary, making it up to 5x faster than openpyxl and 1.5x faster than xlrd. It's also the smallest, adding only 12MB to the build!

We should use python-calamine for XLS and XLSX parsing.

Sujanadh commented 2 months ago

I found that python-calamine is only for reading xls files but can't use it to manipulate the form fields. We would need another package, for example, openpyxl to modify the form in order to inject fmtm mandatory fields.

spwoodcock commented 2 months ago

Oh no! That's a pain 😅

I considered Pandas but it's a huge library (200MB). Also Polars is similar.

Openpyxl is ancient now and unmaintained - but the xlsx spec probably hasn't changed and perhaps it's just feature complete.

It would mean we need to use it alongside xlrd to read xls though!

Two other options:

Use Pandas or Polars for the simplicity?
Just deal with the XML instead of XLS(X)? (This isn't ideal, but perhaps we can make it easier by parsing the XML to dict)

spwoodcock commented 2 months ago

Acknowledgement I have been a bit silly

Pandas is large, but it's also going to be supported for a long time & is easy to use.
While it takes up a lot of space on a fresh install, I tested adding it to our current FMTM stack & it adds very little...
I assume this is because our other dependencies already install the large subdependencies
We can safely add pandas to the backend dependencies - sorry for the mess around!
While python-calamine is definitely a lot faster, it's overkill for our small xlsx files (I was mainly looking at it as an alternative for pandas)

If we ever need to revisit this

If we ever want to remove pandas for any reason, we can consider calamine again for writing, in parallel with XlsxWriter for writing.

python-calamine for reading. We get a list of lists for the XLSX data.
https://github.com/jmcnamara/XlsxWriter for writing. We can iterate through the list of lists and write a simple XLSX file.

We would need to iterate through the survey and choices sheets to combine our defaults with the custom uploaded XLSX.

from xlsxwriter import Workbook
from python_calamine import CalamineWorkbook

workbook = CalamineWorkbook.from_path("form.xlsx")
xlsx_data = workbook.get_sheet_by_name("survey").to_python()
# [
# ["1",  "2",  "3",  "4",  "5",  "6",  "7"],
# ["1",  "2",  "3",  "4",  "5",  "6",  "7"],
# ["1",  "2",  "3",  "4",  "5",  "6",  "7"],
# ]

workbook = Workbook('combined.xlsx')
worksheet = workbook.add_worksheet("survey")
for row_index, row_data in enumerate(xlsx_data):
    for column_index, column_data in enumerate(row_data):
        # Get the column letter from column_index
        column_letter = xxx
        worksheet.write(f'{column_letter}{row_index}', column_data)

spwoodcock commented 2 months ago

Possible issue with translations

Interesting issue I just encountered!

For the choice sheet we have a placeholder: task_filter 1 1 1 1 1

The 1 values are for list_name name and label But they are also required in our workflow for label::English(en) and the other 3 languages. If they are missing, then the value is displayed at a null value in ODK Collect!

I'm raising this because it may cause issues with merging! If our mandatory fields form has the fields: label::English(en) label::Spanish(es) etc But the user uploaded form doesn't have these fields, and just uses a default label field, then it looks like the value for specific languages will take priority: in this case they would be null / empty.

Possible solution

It's quite hard to automatically merge translations, as the user may use the label field with a language other than English.
We probably should address this complexity later for now, and assume the default label will be English, with other languages being in fields like label::Spanish(es).

Scenario 1:

The user uploads a form with label field only.
We remove our label::English(en) label::Spanish(es) etc fields from the mandatory fields form.
We move label::English(en) contents to the label field for consistency.
There is only one language for the form - all users will see this.

Scenario 2:

The user uploads a form with `label::Spanish(es) etc fields.
We try to match these fields with the mandatory field form fields.
We remove any translation fields from the mandatory field form that are not present in the user uploaded form.
Example:
- User uploads a form with label::Spanish(es) label::French(fr) fields
- We match and keep the label::Spanish(es) label::French(fr) fields form our mandatory fields form
- We remove the additional label::English(en) label::Swahili(sw) fields.

[!NOTE] In future we should translation our mandatory field questions into all available languages in ODK. Then we can handle merging for every possible field.

spwoodcock commented 2 months ago

Current mandatory fields in survey sheet: