stencila convert not exiting (Linux)

de-code commented 3 years ago

Hi,

I am not sure whether I misunderstood how it is meant to work.

I downloaded the currently latest release v0.34.3.

When I run stencila convert in order to convert a Word document to PDF, then it does convert the document but doesn't seem to exit.

e.g.:

./stencila convert test.docx test.pdf

I also tried that with a minimal document.

nokome commented 3 years ago

Hi @de-code, thanks for trying this out and taking the time to report this issue. Sorry for the inconvenience.

I have moved this here because the stencila CLI (a) is currently getting a revamp and (b) its convert subcommand is mostly driven by encoda (this repo), and (c) is using an old version of encoda.

I this repo, on master, I tested docx to pdf conversion using one of our test fixtures and didn't have any problems (no hanging, temp.pdf was produced):

./encoda convert src/__tests__/issues/668-CfRadialDoc-v2.1-20190901.docx temp.pdf

I would be grateful, if you could give it another try using encoda directly - either by cloning this repo, running npm install and then the above command, or by doing npm install --global encoda and just using encoda ... (i.e. without the ./ prefix.)

de-code commented 3 years ago

Hi @nokome thank you for looking into the issue.

I have moved this here because the stencila CLI (a) is currently getting a revamp and (b) its convert subcommand is mostly driven by encoda (this repo), and (c) is using an old version of encoda.

Okay, I wasn't sure where to raise it. I then went for the stencila repo because that is what I tested and I am not sure what's in the middle.

I this repo, on master, I tested docx to pdf conversion using one of our test fixtures and didn't have any problems (no hanging, temp.pdf was produced):
./encoda convert src/__tests__/issues/668-CfRadialDoc-v2.1-20190901.docx temp.pdf
I would be grateful, if you could give it another try using encoda directly - either by cloning this repo, running npm install and then the above command, or by doing npm install --global encoda and just using encoda ... (i.e. without the ./ prefix.)

I have now ran that from master.

It does now exit as expected, but I am getting a different behaviour / errors:

rendering different from original

For the minimal example, it does actually render it differently. e.g. it made the first line bigger and bold. (That didn't happen with the stencila convert command).

expected

![image](https://user-images.githubusercontent.com/1016473/104736081-57577080-573a-11eb-9367-785993e79736.png) [minimal-office-open.docx.gz](https://github.com/stencila/encoda/files/5820920/minimal-office-open.docx.gz)

actual

![image](https://user-images.githubusercontent.com/1016473/104736176-7ce47a00-573a-11eb-9fc9-968fd2d9cdf6.png)

documents fail to convert

Other real submission documents fail to convert (tested two).

⚠ WARN  encoda No codec could be found for source "/path/to/document.docx". Falling back to plain text codec.
🚨 ERROR encoda dom.setAttribute is not a function

(unfortunately I can't share those documents)

EDIT: I just tried stencila convert again, and it is failing for those documents too. (pandoc via the pandoc/latex:2.11.3.2 Docker image works fine for those documents, albeit with some Missing character warnings)

command is slow to run

Running ./encoda convert takes around 18+ seconds. Not sure if the command is still compiling something or where the offset is coming from.

Using stencila convert the document would be created after around 5+ seconds (but then didn't exit).

Using pandoc directly took a similar amount of time. (Libre Office seems to be significantly faster)

I have a question that I should maybe raise separately... is there any advantage of using encoda for Word (docx etc) to PDF conversion, compared to using pandoc?

nokome commented 3 years ago

Hi @de-code, sorry for the slow response

Okay, I wasn't sure where to raise it. I then went for the stencila repo because that is what I tested and I am not sure what's in the middle.

That's fine and what we encourage - it's easy enough for us to move issues to the relevant repo once we have been able to identify where the issue resides

rendering different from original

This is intentional. There are four function involved in convert: decode (from docx to JS object) -> coerce (coerce the JS object if necessary so that it validates against https://schema.stenci.la) -> reshape (transform JS object using "semantic interference" of bolded paragraphs as figure captions etc) -> encode (from JS object to PDF). It is the new(ish) reshape function that is probably causing the difference. In the example above, reshape is saying there is no title defined for this article so I will make the first one the title.

There are options to turn off both coerce and reshape but they are not yet enabled from the CLI: https://github.com/stencila/encoda/blob/master/src/codecs/types.ts#L97-L114

documents fail to convert

It would be great to try and isolate what is failing there but without the documents that's obviously hard. Perhaps you could try adding the --debug flag?

command is running slow

Yes, the ./encoda bash script is running ts-node (ie compiling Typescript on the fly). For a much faster CLI experience you can build the JavaScript using npm run build and then use node dist/cli.js convert ....

advantages?

is there any advantage of using encoda for Word (docx etc) to PDF conversion, compared to using pandoc?

There are three "features" of Encoda which may be useful as compared to using Pandoc alone:

Reshaping

As mentioned earlier, Encoda can do some semantic inference of the document to "reshape" it into something that is closer to a structured scholarly article (ala JATS). See https://github.com/stencila/encoda/blob/master/src/util/reshape.ts for details.

Themeing

There are several themes available including those which closely match journal themes e.g. --theme elife. (Unfortunately there seems to have been a change to the eLife theme that means it may not work locally like that :()

Reproducibility

Encoda uses Puppeteer to generate the PDF (with the CSS based styles) but then encodes an XML representation of the original document (actually the coerced and reshaped document) into the PDF's metadata. This allows you to do back to the original format, or a different format from the PDF e.g.

node dist/cli.js convert src/codecs/ipynb/__fixtures__/well-switching.ipynb well-switching.docx
node dist/cli.js convert well-switching.docx well-switching.pdf --theme nature
node dist/cli.js convert well-switching.pdf well-switching.md

Hope that helps. If you think that Encoda might be useful for your use case but there are particular blockers like the above let us know and we'll try to remedy them.

de-code commented 3 years ago

Hi @nokome thank you for your very informative response (and apologies for the delay).

command is running slow

I can confirm that after using node dist/cli.js convert the time when down to what it was using the stencila command.

documents fail to convert

Looks like this turns out to have been a "user error", coupled with a perhaps less informative error message. For whatever reasons I used ~ in the path for those documents (as a placeholder for the home directory), which isn't getting expanded. But rather than complaining about the file not being found, it mentions the codec issue (from my earlier error message). So that might be a quick thing that could be improved.

For the stencila command, the ~ got expanded.

I will close the issue. Let me know if you would like me to raise separate issues for the error message and ~ expansion.

For now I concluded to stick with LibreOffice for my narrow use-case of converting Word files to PDF. Mainly because it is a) faster (as it is just rendering) b) better preserves the original document and c) supports more Word like document formats. I understand that isn't the main purpose of encoda anway. But your explanations helped me better understand the tool and it will likely become useful in the future.

nokome commented 3 years ago

Thanks @de-code

coupled with a perhaps less informative error message

I agree, that needs to be improved. If you could raise another issue for that it would be appreciated.

stencila / encoda