open-sdg / sdg-build

Python package to convert SDG-related data and metadata between formats
MIT License
5 stars 23 forks source link

Metadata input for Word template files #230

Closed brockfanning closed 3 years ago

brockfanning commented 3 years ago

This is intended to allow the input of metadata using this Word template: https://github.com/sdmx-sdgs/metadata

Use-case: We already have an input for metadata from SDMX, and it is possible for countries to use a pilot authoring tool to manually convert these Word templates into SDMX. This PR is a convenience input so that countries can skip that manual step.

However, presumably they still want SDMX output for this metadata. So we may also want to add an output for SDMX metadata (or add that feature to the existing SDMX output).

LucyGwilliamAdmin commented 3 years ago

@brockfanning this seems to work - metadata is being built from docm file - is there anything in particular you want me to look at?

https://onsdigital.github.io/sdg-sdmx-data/en/meta/8-4-2.json

brockfanning commented 3 years ago

@LucyGwilliamAdmin The tricky things that always cause issues with Word documents are:

  1. Images
  2. Tables
  3. Footnotes
  4. Italics/bold/underline

So maybe confirming that those things are working?

LucyGwilliamAdmin commented 3 years ago

Ah ok thanks - I will change prose file so can view fields on platform

LucyGwilliamAdmin commented 3 years ago

@brockfanning I've uploaded all of those docm files to the repo but I'll just start by doing a comparison of 12.4.1 since that has a lot of what you mentioned:

I looked at 6.5.1 for a table and that comes through fine

Checked bold in 12.2.2 and works fine

Should font always be black even if there's coloured font in word doc?

brockfanning commented 3 years ago

@LucyGwilliamAdmin Font color doesn't come through, no. That hasn't been implemented in the metadata authoring tool either.

I'm not sure why the bold/italics in "Basel Convention", etc, did not come through. Maybe it has a problem when it is both bold and italics at the same time? It could also have to do with the "Style" that is selected in Word.

LucyGwilliamAdmin commented 3 years ago

@brockfanning - thanks for confirming re. colour

I think out of those 2, it might be the style that's selected in word - I just checked and a different style is selected to that of which are showing correctly (bold-italic) on platform

brockfanning commented 3 years ago

@LucyGwilliamAdmin Ok, that sounds likely. Actually that brings up a good topic that should be discussed:

There is a node.js library that is being used by the pilot authoring tool to convert these Word documents into SDMX. That library has its own logic on whether/how it brings in all the stuff from the Word doc.

What I've done in this PR is try to replicate that part of the node.js library. In an ideal world, I wouldn't have replicated it -- I would have just used it. But unfortunately that library is in node.js and this one is Python. So I went with the approach of replicating it.

But this brings up an important question: should we improve on this logic? For example, if we see a way to bring in some bold/italics that are being missed, should we implement that here? I'd argue that we shouldn't -- because it would not be implemented in the authoring tool. And I think we mostly want this input to be as faithful as possible to what that authoring tool does.

Instead maybe we should treat this InputWordMeta like a third-party library, and if we want to improve it, we try to get sdg-metadata-convert improved first.

On the other hand, if sdg-metadata-convert has some behavior that this InputWordMeta does not, then we should definitely implement that here.

What do you think?

LucyGwilliamAdmin commented 3 years ago

@brockfanning I agree with your approach and think that we should only implement what's in the node.js library.

For this issue, I have just checked and the authoring tool also doesn't bring in those specific headings.

LucyGwilliamAdmin commented 3 years ago

@brockfanning is there anything else I should check on this one?

brockfanning commented 3 years ago

@LucyGwilliamAdmin Nothing that I can think of. I just resolved the conflicts.