pmaengineering / ppp

The Pretty PDF Printer--Converting ODK XlsForm Excel files into "paper questionnaires".
http://ppp.pma2020.org
MIT License
11 stars 6 forks source link

Doc/Docx native format file generation #2

Open joeflack4 opened 6 years ago

joeflack4 commented 6 years ago

Description

This would be a new feature in that the PPP could output .docx rather than just .doc. It would also be an improvement to .doc, in that the generated .doc file (should we choose to keep it) in that the .doc file would be in its native format, rather than actually an .html document with a .doc file extension.

Currently, a .docx file needs to be created manually. Meanwhile, .doc file produced by PPP is buggy since it is just plain html underneath. This can be seen by opening the file in google docs or open office. If opened in MS Word, it appears to be fine. Though opening and saving in Word converts it into such a format that it is now compatible with open office and google docs. Ideally, this process should be streamlined so that the user does not have to do this manually.

Possible solutions

1) Lowriter for HTML to .doc/.docx conversion

Example: lowriter --headless --convert-to docx ~/file.html

2) Custom implementation of .doc/.docx generator using OOXML

Although the older binary formats (.doc, xls, and .ppt) continue to be supported by Microsoft, OOXML is now the default format of all Microsoft Office documents (.docx, .xlsx, and .pptx). http://officeopenxml.com/ https://en.wikipedia.org/wiki/Office_Open_XML

3) Use workflow automation tools

https://blog.testproject.io/2016/12/22/open-source-test-automation-tools-for-desktop-applications/

tulvit commented 6 years ago

How to reproduce the "bug"*

*"bug" in a quotation marks, because it's not a bug, just a different capabilities of different office engines in terms of converting web page into doc

Why does this happen? It seems that only MS Office's engine is capable to parse this HTML in a desired way.

So, this test.doc file becomes a good valid document file only and only when MS Office repairs/parses it. In other words, on first file open MS Office doesn't "open" it, but "repairs/parses/converts/whatever".

And without this "re-saving" procedure all we have is just an html file manually renamed to a document file.

It's not a problem if the end user will use MS Office. It'll be a problem, if anybody will use any other document editor/viewer other than MS Office, including GoogleDocs.

joeflack4 commented 6 years ago

Thanks for all of the very specific details. This is useful.

Your MS Word looks the same as mine. On my computer at least, changing file extension to .doc and then opening in MS Word and hitting save, and close, did not end up changing the file size. I did not test in Open Office or Google Docs yet, though.

I'm using OSX High Sierra 10.13.3, MS Word 15.26, from 2016.

tulvit commented 6 years ago

@joeflack4

On my computer at least, changing file extension to .doc and then opening in MS Word and hitting save, and close, did not end up changing the file size.

My assumption (in which I'm 99.99% certain, though) is that you didn't save the file at all. You open the file, you do nothing, you click "Save" - and nothing happens, though it may seem as it was saved (because there were no changes, so nothing to save - in LibreOffice, for example, "Save" button will be inactive in this case, and on the other hand, it's kind of strange why "Save" button is clickable in MS Office).

What I mean... Please, take a look at the creation date of the file you will open. Say, it'll be "2018 June 12:25pm". Then open and "save" it, check the date again - it'll be the same "12:25pm", so nothing was saved, it's the same old file.

How to save it without "saving as"? For example, add and delete some character, a space/full stop/etc. So there will be some "change" in a document. Then press "save".

And voilà - file's date is changed, as well as the size (so the file was actually saved).

*I was struggling with this "saving but not saving" MS Office behaviour yesterday as well, but quickly noticed unchanged date of a "saved" file.

Ah, and some quite important thing to mention. After renaming test.html to test.doc and opening it in MS Office - MS Office will think that it's a "web document". After adding/deleting a space character (or any other modification, so the file will be actually saved) and clicking "save" - it'll be successfully saved. But with a "web page" type! So, if we then open this saved file in GoogleDocs, it'll be rendered like this:

image

So, in order to produce a repaired good working doc file, it should not be only opened in MS Office, but "Saved as" with a type of document specified.

tulvit commented 6 years ago

Possible ways to handle it

Disclaimer: it's just an ideas/suggestions as long as I have little to nothing experience in this field.

1. Do nothing. If end users will always open provided file in a MS Office - then there will be no problem at all. (Probably, some new versions of MS Office may render html in a different way, bet let's assume it'll never happen.)

2. Manually open each and every generated file in a MS Office, re-save it, and only then send it to the end user. May work only if there are only a few users. For 10-100 users a day it'll take a whole day, with 100+ users it'll be just impossible.

3. Automation of the previous re-saving procedure. Just the same as 2, but with scripts, not hands. I. e. after the html file is generated, it'll be opened in MS Office and then re-saved automatically, via Windows API or some tool/software available.

4. Rebuilding HTML output in such a way, that any office software will render it in a desired way. Making it much simpler, trying to fix particular bugs (the main problem with OpenOffice - rendering forms, and with GoogleDocs - nested table cells). Makes sense, but not a good option either - fixing bug after the bug, and anyways leaving out all other software (Polaris Office, WPS Office, dozens of them).

5. Creating .doc the right way. Not via .html -> .doc, but generating doc file right away. There are already quite a bit of such libraries (python-docx, PHPOffice), but last time I've checked all of them offered only basic operations like creating a header and adding paragraphs which will not suit our needs.

6. Using some open text formats instead of .doc ODT, I assume. Didn't investigate it so far, but I think it'll give more options to generate files via API and there should be already a lot of opensource solutions. And .odt files should work just fine in all office suits. So it's basically the same as 5, but .odt instead of .doc (and it allows us to leave MS Office out of the picture).

7. .odt to .doc Basically, same as 6, but a little bit extended. As soon as we'll have a valid .odt file, I believe it'll be real to convert it to a *.doc format the right way. If I'm not wrong, OpenOffice offers CLI application as well, so it should be pretty straightforward.

8. Something else... I guess, there are still some other options, and probably much better ones.

tulvit commented 6 years ago

UPDATE

Current thoughts after a little bit of investigation.

So, the final goal is to produce a valid doc file automatically.

Right now we have .html file renamed to .doc file (so, actually, it's just a broken doc file), and only Word can repair it well. Not a good workflow.

And back to OpenOffice. Let's assume it'll be possible to edit html code in such a way Open/Libre Office will be able to parse it. Actually, I've already tested it, and there is a good chance to do it. There are a lot of problems, though (OpenOffice ignores some HTML rules, like CSS styles for tables, doesn't ignore HTML comments, and so on and so forth).

But again, let's assume edited/modified html file will be rendered in OpenOffice well, then what?

Then it'll be possible to produce a valid doc or docx file just in a single command:

lowriter --headless --convert-to docx ~/file.html

So it will not be just a renamed html file anymore, but a valid doc file, which may be opened in any software (OpenOffice, MS Word, GoogleDocs, you name it...) without any problems. And the issue will be solved.

At this moment I'll consider this route as the main (or even the only one possible) option.

joeflack4 commented 6 years ago

@tulvit I am in agreement. I would like to try this.

The only possible disadvantage this route that I can see is if we later try to implement a special feature, where we allow a user to edit the word document, save it, and then run a special command to merge their changes back into the original XlsForm excel file. I feel like this would be much easier to implement if the underlying data structure of the document were HTML.

However, as there are no plans presently to implement that feature, let's not worry about it. I give you permission to proceed with your strategy.