Closed de-code closed 3 years ago
Hi @de-code, thanks for trying this out and taking the time to report this issue. Sorry for the inconvenience.
I have moved this here because the stencila
CLI (a) is currently getting a revamp and (b) its convert
subcommand is mostly driven by encoda
(this repo), and (c) is using an old version of encoda
.
I this repo, on master
, I tested docx
to pdf
conversion using one of our test fixtures and didn't have any problems (no hanging, temp.pdf
was produced):
./encoda convert src/__tests__/issues/668-CfRadialDoc-v2.1-20190901.docx temp.pdf
I would be grateful, if you could give it another try using encoda
directly - either by cloning this repo, running npm install
and then the above command, or by doing npm install --global encoda
and just using encoda ...
(i.e. without the ./
prefix.)
Hi @nokome thank you for looking into the issue.
I have moved this here because the
stencila
CLI (a) is currently getting a revamp and (b) itsconvert
subcommand is mostly driven byencoda
(this repo), and (c) is using an old version ofencoda
.
Okay, I wasn't sure where to raise it. I then went for the stencila
repo because that is what I tested and I am not sure what's in the middle.
I this repo, on
master
, I testeddocx
totemp.pdf
was produced):./encoda convert src/__tests__/issues/668-CfRadialDoc-v2.1-20190901.docx temp.pdf
I would be grateful, if you could give it another try using
encoda
directly - either by cloning this repo, runningnpm install
and then the above command, or by doingnpm install --global encoda
and just usingencoda ...
(i.e. without the./
prefix.)
I have now ran that from master
.
It does now exit as expected, but I am getting a different behaviour / errors:
For the minimal example, it does actually render it differently. e.g. it made the first line bigger and bold. (That didn't happen with the stencila convert
command).
Other real submission documents fail to convert (tested two).
⚠ WARN encoda No codec could be found for source "/path/to/document.docx". Falling back to plain text codec.
🚨 ERROR encoda dom.setAttribute is not a function
(unfortunately I can't share those documents)
EDIT: I just tried stencila convert
again, and it is failing for those documents too. (pandoc
via the pandoc/latex:2.11.3.2
Docker image works fine for those documents, albeit with some Missing character
warnings)
Running ./encoda convert
takes around 18+ seconds. Not sure if the command is still compiling something or where the offset is coming from.
Using stencila convert
the document would be created after around 5+ seconds (but then didn't exit).
Using pandoc
directly took a similar amount of time. (Libre Office seems to be significantly faster)
I have a question that I should maybe raise separately... is there any advantage of using encoda
for Word (docx etc) to PDF conversion, compared to using pandoc
?
Hi @de-code, sorry for the slow response
Okay, I wasn't sure where to raise it. I then went for the stencila repo because that is what I tested and I am not sure what's in the middle.
That's fine and what we encourage - it's easy enough for us to move issues to the relevant repo once we have been able to identify where the issue resides
This is intentional. There are four function involved in convert
: decode
(from docx to JS object) -> coerce
(coerce the JS object if necessary so that it validates against https://schema.stenci.la) -> reshape
(transform JS object using "semantic interference" of bolded paragraphs as figure captions etc) -> encode
(from JS object to PDF). It is the new(ish) reshape
function that is probably causing the difference. In the example above, reshape
is saying there is no title defined for this article so I will make the first one the title.
There are options to turn off both coerce
and reshape
but they are not yet enabled from the CLI: https://github.com/stencila/encoda/blob/master/src/codecs/types.ts#L97-L114
It would be great to try and isolate what is failing there but without the documents that's obviously hard. Perhaps you could try adding the --debug
flag?
Yes, the ./encoda
bash script is running ts-node
(ie compiling Typescript on the fly). For a much faster CLI experience you can build the JavaScript using npm run build
and then use node dist/cli.js convert ...
.
is there any advantage of using encoda for Word (docx etc) to PDF conversion, compared to using pandoc?
There are three "features" of Encoda which may be useful as compared to using Pandoc alone:
As mentioned earlier, Encoda can do some semantic inference of the document to "reshape" it into something that is closer to a structured scholarly article (ala JATS). See https://github.com/stencila/encoda/blob/master/src/util/reshape.ts for details.
There are several themes available including those which closely match journal themes e.g. --theme elife
. (Unfortunately there seems to have been a change to the eLife theme that means it may not work locally like that :()
Encoda uses Puppeteer to generate the PDF (with the CSS based styles) but then encodes an XML representation of the original document (actually the coerced and reshaped document) into the PDF's metadata. This allows you to do back to the original format, or a different format from the PDF e.g.
node dist/cli.js convert src/codecs/ipynb/__fixtures__/well-switching.ipynb well-switching.docx
node dist/cli.js convert well-switching.docx well-switching.pdf --theme nature
node dist/cli.js convert well-switching.pdf well-switching.md
Hope that helps. If you think that Encoda might be useful for your use case but there are particular blockers like the above let us know and we'll try to remedy them.
Hi @nokome thank you for your very informative response (and apologies for the delay).
I can confirm that after using node dist/cli.js convert
the time when down to what it was using the stencila
command.
Looks like this turns out to have been a "user error", coupled with a perhaps less informative error message.
For whatever reasons I used ~
in the path for those documents (as a placeholder for the home directory), which isn't getting expanded.
But rather than complaining about the file not being found, it mentions the codec issue (from my earlier error message).
So that might be a quick thing that could be improved.
For the stencila
command, the ~
got expanded.
I will close the issue. Let me know if you would like me to raise separate issues for the error message and ~
expansion.
For now I concluded to stick with LibreOffice for my narrow use-case of converting Word files to PDF. Mainly because it is a) faster (as it is just rendering) b) better preserves the original document and c) supports more Word like document formats. I understand that isn't the main purpose of encoda anway. But your explanations helped me better understand the tool and it will likely become useful in the future.
Thanks @de-code
coupled with a perhaps less informative error message
I agree, that needs to be improved. If you could raise another issue for that it would be appreciated.
Hi,
I am not sure whether I misunderstood how it is meant to work.
I downloaded the currently latest release v0.34.3.
When I run
stencila convert
in order to convert a Word document to PDF, then it does convert the document but doesn't seem to exit.e.g.:
I also tried that with a minimal document.