Closed spwoodcock closed 1 month ago
This would make the XLSForms a whole lot more maintainable.
Currently if we have say 10 categories = 10 XLSForms, we need to keep the extra fields we use in FMTM in sync for all of them.
Ideally we have:
I just did a small assessment of package size between various XLS and XLSX parser libraries:
python-calamine
seems like the most sustainable choice (it was recently integrated into Pandas, so should have good community support). At the low level, it's actually using a compiled rust
binary, making it up to 5x faster than openpyxl and 1.5x faster than xlrd. It's also the smallest, adding only 12MB to the build!We should use python-calamine
for XLS and XLSX parsing.
I found that python-calamine
is only for reading xls
files but can't use it to manipulate the form fields. We would need another package, for example, openpyxl
to modify the form in order to inject fmtm mandatory fields.
Oh no! That's a pain 😅
I considered Pandas but it's a huge library (200MB). Also Polars is similar.
Openpyxl is ancient now and unmaintained - but the xlsx
spec probably hasn't changed and perhaps it's just feature complete.
It would mean we need to use it alongside xlrd to read xls
though!
Two other options:
If we ever want to remove pandas for any reason, we can consider calamine
again for writing, in parallel with XlsxWriter
for writing.
python-calamine
for reading. We get a list of lists for the XLSX data.We would need to iterate through the survey
and choices
sheets to combine our defaults with the custom uploaded XLSX.
from xlsxwriter import Workbook
from python_calamine import CalamineWorkbook
workbook = CalamineWorkbook.from_path("form.xlsx")
xlsx_data = workbook.get_sheet_by_name("survey").to_python()
# [
# ["1", "2", "3", "4", "5", "6", "7"],
# ["1", "2", "3", "4", "5", "6", "7"],
# ["1", "2", "3", "4", "5", "6", "7"],
# ]
workbook = Workbook('combined.xlsx')
worksheet = workbook.add_worksheet("survey")
for row_index, row_data in enumerate(xlsx_data):
for column_index, column_data in enumerate(row_data):
# Get the column letter from column_index
column_letter = xxx
worksheet.write(f'{column_letter}{row_index}', column_data)
Interesting issue I just encountered!
For the choice sheet we have a placeholder:
task_filter 1 1 1 1 1
The 1 values are for list_name
name
and label
But they are also required in our workflow for label::English(en)
and the other 3 languages.
If they are missing, then the value is displayed at a null value in ODK Collect!
I'm raising this because it may cause issues with merging!
If our mandatory fields form has the fields: label::English(en)
label::Spanish(es)
etc
But the user uploaded form doesn't have these fields, and just uses a default label
field, then it looks like the value for specific languages will take priority: in this case they would be null / empty.
label
field with a language other than English.label
will be English, with other languages being in fields like label::Spanish(es)
.Scenario 1:
label
field only.label::English(en)
label::Spanish(es)
etc fields from the mandatory fields form.label::English(en)
contents to the label
field for consistency.Scenario 2:
`label::Spanish(es)
etc fields.label::Spanish(es)
label::French(fr)
fieldslabel::Spanish(es)
label::French(fr)
fields form our mandatory fields formlabel::English(en)
label::Swahili(sw)
fields.[!NOTE] In future we should translation our mandatory field questions into all available languages in ODK. Then we can handle merging for every possible field.
Current mandatory fields in survey sheet:
Current mandatory fields in choices sheet:
Current mandatory fields in entities sheet:
For future reference, openpyxl
is still maintained, but is available here instead! https://foss.heptapod.net/openpyxl/openpyxl
openpyxl
can be used to read AND write XLSX...
My bad for missing that. Although it makes no difference now, as we are using higher level Pandas anyway.
This may be useful for the future though, if one day we remove our reliance on Pandas in osm-fieldwork.
Is your feature request related to a problem? Please describe.
Describe the solution you'd like
Option 1: append the fields in the XLSForm, either via pyxlsx or possibly via Pandas, prior to XLS --> XML.
Option 2: use the current XML parsing / manipulation method to inject all required params, after XLS --> XML.
Additional context