hotosm / fmtm

Field Mapping Tasking Manager - coordinated field mapping.
https://fmtm.hotosm.org/
GNU Affero General Public License v3.0
47 stars 46 forks source link

Injecting mandatory FMTM fields into custom XLSForms #1722

Closed spwoodcock closed 1 month ago

spwoodcock commented 3 months ago

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Additional context

[!Note] See comment below. We should not use Pandas, but instead python-calamine.

spwoodcock commented 3 months ago

This would make the XLSForms a whole lot more maintainable.

Currently if we have say 10 categories = 10 XLSForms, we need to keep the extra fields we use in FMTM in sync for all of them.

Ideally we have:

spwoodcock commented 3 months ago

I just did a small assessment of package size between various XLS and XLSX parser libraries:

image

We should use python-calamine for XLS and XLSX parsing.

Sujanadh commented 2 months ago

I found that python-calamine is only for reading xls files but can't use it to manipulate the form fields. We would need another package, for example, openpyxl to modify the form in order to inject fmtm mandatory fields.

spwoodcock commented 2 months ago

Oh no! That's a pain 😅

I considered Pandas but it's a huge library (200MB). Also Polars is similar.

Openpyxl is ancient now and unmaintained - but the xlsx spec probably hasn't changed and perhaps it's just feature complete.

It would mean we need to use it alongside xlrd to read xls though!

Two other options:

spwoodcock commented 2 months ago

Acknowledgement I have been a bit silly

If we ever need to revisit this

If we ever want to remove pandas for any reason, we can consider calamine again for writing, in parallel with XlsxWriter for writing.

We would need to iterate through the survey and choices sheets to combine our defaults with the custom uploaded XLSX.

from xlsxwriter import Workbook
from python_calamine import CalamineWorkbook

workbook = CalamineWorkbook.from_path("form.xlsx")
xlsx_data = workbook.get_sheet_by_name("survey").to_python()
# [
# ["1",  "2",  "3",  "4",  "5",  "6",  "7"],
# ["1",  "2",  "3",  "4",  "5",  "6",  "7"],
# ["1",  "2",  "3",  "4",  "5",  "6",  "7"],
# ]

workbook = Workbook('combined.xlsx')
worksheet = workbook.add_worksheet("survey")
for row_index, row_data in enumerate(xlsx_data):
    for column_index, column_data in enumerate(row_data):
        # Get the column letter from column_index
        column_letter = xxx
        worksheet.write(f'{column_letter}{row_index}', column_data)
spwoodcock commented 2 months ago

Possible issue with translations

Interesting issue I just encountered!

For the choice sheet we have a placeholder: task_filter 1 1 1 1 1

The 1 values are for list_name name and label But they are also required in our workflow for label::English(en) and the other 3 languages. If they are missing, then the value is displayed at a null value in ODK Collect!

I'm raising this because it may cause issues with merging! If our mandatory fields form has the fields: label::English(en) label::Spanish(es) etc But the user uploaded form doesn't have these fields, and just uses a default label field, then it looks like the value for specific languages will take priority: in this case they would be null / empty.

Possible solution

Scenario 1:

Scenario 2:

[!NOTE] In future we should translation our mandatory field questions into all available languages in ODK. Then we can handle merging for every possible field.

spwoodcock commented 2 months ago

Current mandatory fields in survey sheet:

image

image

image

Current mandatory fields in choices sheet:

image

Current mandatory fields in entities sheet:

image

spwoodcock commented 2 months ago

For future reference, openpyxl is still maintained, but is available here instead! https://foss.heptapod.net/openpyxl/openpyxl

openpyxl can be used to read AND write XLSX...

My bad for missing that. Although it makes no difference now, as we are using higher level Pandas anyway.

This may be useful for the future though, if one day we remove our reliance on Pandas in osm-fieldwork.