Open gusbrs opened 6 years ago
Thanks, that it is quite massive report :o
I will need some time to process it, some issues may be quite hard to fix.
My first findings:
issue with \url
inside \thanks
: it is due to the fact that tex4ht
process \title
and \author
commands using \edef
and \url
isn't expandable. I am afraid that we cannot fix it, but it is possible to use
\noexpand\url{https://my.site.com/}
as a workaround
accented characters as labels - it is best to not use that, it is safe to use only ascii characters. it is possible to support it using Unicode engine. For example
make4ht -ul -f odt geopoltest1.tex
seems to work and fixes next issue:
\protect
in https:\protect/my.site.com/
- this seems to be inserted by Brazilian definitions for Babel, LuaTeX fixes that and the URL is correct.
regarding amperesands in the XML file, this was bug in make4ht
, the filter that converts XML entities back to Unicode didn't take into accound forbidden characters that break XML validity. I've updated make4ht
and it should work now.
there is an issue in bibliography that one record uses URL in the form of https://doi.org/10.1002/(SICI)1096-987X(199803)19:4<377::AID-JCC1>3.0.CO;2-P"
so it contains <
and >
characters, which breaks XML. I think that it is a bug in the bib file, all URLs should be in safe form, like https://doi.org/10.1002%2F%28SICI%291096-987X%28199803%2919%3A4%3C377%3A%3AAID-JCC1%3E3.0.CO%3B2-P
there are lot of issues with footnotes, all of them are doubled. this seems like a issue with Koma script, it is OK with standard classes or in HTML. So I will need to investigate it more.
the eaten spaces are weird, this seems like a bug in the DVI processing, the space in the title is correct if I remove the \vspace
command.
I will try to fix these and other isues later.
I've also found two entries in biblatex-example.bib
which cause invalid XML - knuth:ct:related
and knuth:ct:a
. The ODT file can be opened after I removed them. This is definitely a bug in tex4ht
.
@michal-h21 Nice to see things going that fast. Thank you very much! I'll be following attentively your comments here and, if need be, will comment back (So far, I have nothing to add to your observations). And, if you reach a point where you want me to test things again, just let me know.
today I've fixed some issues in tex4ht
sources, in quest to make the resulting ODT
file valid in the ODF validator. I've removed some DTD definitions that didn't really work, there are still some validation issues with math, but I think I am on a good path.
One huge success is that Word can now open the ODT file and display math, which it didn't support up until now. The issue was only wrong mime type in the file directory. It is really good that it is no longer necessary to fix the ODT file in LibreOffice.
On the negative side, pandoc
cannot convert the ODT file, even if it is perfectly valid, it reports only:
Couldn't parse odt file.
This needs further investigation.
Bad thing is that with every fix I find more bugs, so there is still lot of things to do.
As you asked (or as was my misunderstanding of your request :), I did some testing for ODT output with
make4ht
. My approach here was to start from an actual working document of mine, with all the elements I usually employ, to reduce it to an actual smaller testing document which retained its complexity and elements. I’ve removed though nested tabular/makecell elements, for I wanted to test things withmake4ht
"vanilla".Indeed, all testing was done with:
without any additional config or make files. And
biber filename
as appropriate, of course.As for environment, tests were done with a full and up-to-date TeX Live 2018, with the current dev version of
make4ht
on a Linux Mint 18.3, also up-to-date.The test files are available at: https://gist.github.com/gusbrs/36ea400945e7031096464a8f98e001b4 (Please download them and let me know when you’ve done so. As they were derived from a working document of mine, I don’t want to leave this publicly available.)
There are three files. The first one was built with the above intention in mind, and compiled and tested with
pdflatex
. Now, this file, as it is, is not really amenable to be built withmake4ht
. So I had to strip down some things to reach the second file which, as the first, is based on thescrartcl
class. The third test file, in turn, is a version of the second one with the standardarticle
class.What had to be removed from the full document to get results with
make4ht
thanks
:\url
for\texttt
\label{sec:Introdução}
and corresponding reference leads to errors in compilation, so it was substituted with\label{sec:Introducao}
\nocite{*}
leads to problems with ampersands in other parts of the document (and in the bibliography as well). (I have usedbiblatex-examples.bib
for the test files).\nocite{*}
uncommented, dully escaped\&
in TeX input elsewhere end up incontent.xml
as raw&
, thus breaking ODT output.\nocite{*}
commented. But you can reproduce the error uncommenting it. You’ll see that LibreOffice will report error in some ampersands in a quote environment earlier in the document.hyperref=false
to biblatex’s options.With these changes, we have the second test file, which is compilable and produces reasonable (though improvable) output.
Log files (full piped terminal output) for both the second and third test files are available at: https://gist.github.com/gusbrs/f822630ffd09029871401fe54c3746a2
Comments on the second (scrartcl) ODT output
make4ht
gobbles space between the lines ("toa" instead of "to a")\thanks
is not appropriate.abstract
environment doesn’t seem to be recognized\clearpage
is not respected (I haven’t forgotten https://tex.stackexchange.com/q/435235/105447, of course. But, as you mentioned there that that solution breaks other things, I report it here as a standing issue)\nameref
is placed after the content (and introduces spurious space in the process) ("Introdução_,")quoting
environment simply vanishes from output (following paragraph is gobbled in the process)displayquote
environments and variants are recognized as regular paragraphs (truecsquotes
is configured to usequoting
environments), in the process paragraph breaks (empty lines) are gobbledquotation
environment different fromquote
environment\
after an abbreviation point to avoid extra "end of sentence spacing" with frenchspacing is turned into a non-breaking space in ODTfloatnotes
. The environment appears at the end offigure
environment, but line breaks withinfloatnotes
and between it and the caption are gobbled. The entirefloatnotes
environment vanishes ontable
floats.multicolumn
)description
environments a paragraph break is introduced between label and textquotation
environment and hanging indent in bibliography are very largequotation
environment is not justifiedquotation
andquote
seem to be rendered in a frame/box (I don’t know what it is, nor if it is desirable. But I can’t seem to be able to delete it in the resulting ODT.)Comments on the third (article) ODT output
Here some things seem to work better:
abstract
environment is recognizedBut pretty much everything else stands on the same ground.
Comments on the third (article) resulting
content.xml
text:span
environments (I won’t say this is an "issue", but it would be nice to have a cleanercontent.xml
. If it is possible for regular paragraphs, why not for the rest?)content.xml
as "invalid". LibreOffice seems to be OK with it (well, it opens the file but, as the confusion with the current language shows, probably not everything is OK) and I don’t know if Emacs would be an authoritative source on the matter, but some consistency check oncontent.xml
might be welcome.content.xml
(including thequoting
environment and the missingfloatnotes
environments), which suggests this is a consistency problem incontent.xml
. My guess though is that gobbled line and paragraph breaks are gone for good (but those are, of course, much less important).Well, I hope this testing is useful. Thank you for the great work! And, as usual, I remain at your disposal for discussion and further testing.