Closed jechterhoff closed 1 year ago
Thanks @jechterhoff for reporting this. On first look I do not know why this is happening.
I tried these two snippets locally and I get syntax highlighting for both, in the highlight pattern of the second image.
By any chance can you share the source for reproduction? Thanks!
@ronaldtse The source is actually available in the following GitHub repository: https://github.com/Geonovum/uml2json
Our code prettifier to date has been https://github.com/googlearchive/code-prettify
We constrained the languages we recognised to a small number that we knew (at the time) that Google code-prettify supported; JSON was not in the list. (In fact, it still isn't.)
Because JSON is not in the list, whatever code-prettify is doing, it is doing without any guidance from us: it is inferring the span CSS classes to apply on its own. It is applying spans in the second instances that it doesn't on the first, but I'm not convinced that it actually knows what it's doing on the second span. I don't know why Ronald is getting different results from me, but I do know that whatever code-prettify is doing, I don't trust it.
As it turns out, the code-prettify repo was archived a month ago, and the URI we are invoking was supposed to have been turned off by Google 3 years ago. It hasn't, and this is clearly still a heavily used script, but it makes sense for us to move on from it, to an actively maintained syntax highlighter.
After looking at the 4 solutions Asciidoctor natively supports, highlight.js, rouge, pygments, and coderay, I'm going to go with rouge. I would rather go with highlight.js frankly, because that's a lot less work for me, but we do want our HTML to be self-contained as much as possible, and server-side also allows us to extend highlighting to DOC. (PDF in principle too, but that might prove more fiddly.)
I'm avoiding pygments because we'd rather not drag python into the compilation chain if we can avoid it, and coderay seems a lot more restricted than rouge, in both language support, and pre-fab skins (which is going to be an issue for HTML styling specific to SDOs.)
All of rouge, pygments, and coderay support line numbering, even though highlights.js refuses to (and there is a very popular plugin that does implement line numbering.) I've been reluctant to add line numbering as an attribute to source code, but I see that it is a recurring enough requirement to support, and that Asciidoctor does support it in its integrations (though its rouge support is lagging).
So, I will be switching syntax highlighting in HTML (and Word, and potentially PDF) to Rouge. This will take a few days, but unless some of the issues stuck in queue become more urgent, I should have this done in time for next Metanorma release next week.
@opoudjis Sounds good to me. That may also solve #476.
It won't, @jechterhoff, because the quotes are being made smart at an earlier stage of processing. I have not been able to replicate a duplicate issue instance of this behaviour, and I'll see if I can replicate it in your case.
@Intelligent2013 I am about to move Syntax Highlighting for code snippets from client-side, and HTML-only, to server-side, on compilation of Metanorma.
The syntax highlighting generates HTML with CSS styling in spans. If I do the styling in Presentation XML instead of my HTML output, and convert the output from HTML back to XML (so removing the <br/>
), can you process the CSS-styled spans into something sensible in PDF?
You would be getting something like:
<sourcecode>require <span class="s2">"uri"</span> <span class="k">if</span> /^2<span class="se">\.</span>/.match?<span class="o">(</span>....</sourcecode>
or, if we enable line numbers,
<table type="sourcecode">
<tbody>
<tr><td>1<br/>2<br/>3...</td></tr>
<tr><td><sourcecode>...</sourcecode></td></tr>
</tbody>
</table>
I'd end up putting the CSS definitions in the misc-container element for you to access:
".highlight table td { padding: 5px; }\n.highlight table pre { margin: 0; }\n.highlight, .highlight .w {\n color: #303030;\n}\n.highlight .err {\n color: #151515;\n background-color: #ac4142;\n}\n.highlight .c, .highlight .ch, .highlight .cd, .highlight .cm, .highlight .cpf, .highlight .c1, .highlight .cs {\n color: #505050;\n}\n.highlight .cp {\n color: #f4bf75;\n}\n.highlight .nt {\n color: #f4bf75;\n}\n.highlight .o, .highlight .ow {\n color: #d0d0d0;\n}\n.highlight .p, .highlight .pi {\n color: #d0d0d0;\n}\n.highlight .gi {\n color: #90a959;\n}\n.highlight .gd {\n color: #ac4142;\n}\n.highlight .gh {\n color: #6a9fb5;\n background-color: #151515;\n font-weight: bold;\n}\n.highlight .k, .highlight .kn, .highlight .kp, .highlight .kr, .highlight .kv {\n color: #aa759f;\n}\n.highlight .kc {\n color: #d28445;\n}\n.highlight .kt {\n color: #d28445;\n}\n.highlight .kd {\n color: #d28445;\n}\n.highlight .s, .highlight .sb, .highlight .sc, .highlight .dl, .highlight .sd, .highlight .s2, .highlight .sh, .highlight .sx, .highlight .s1 {\n color: #90a959;\n}\n.highlight .sa {\n color: #aa759f;\n}\n.highlight .sr {\n color: #75b5aa;\n}\n.highlight .si {\n color: #8f5536;\n}\n.highlight .se {\n color: #8f5536;\n}\n.highlight .nn {\n color: #f4bf75;\n}\n.highlight .nc {\n color: #f4bf75;\n}\n.highlight .no {\n color: #f4bf75;\n}\n.highlight .na {\n color: #6a9fb5;\n}\n.highlight .m, .highlight .mb, .highlight .mf, .highlight .mh, .highlight .mi, .highlight .il, .highlight .mo, .highlight .mx {\n color: #90a959;\n}\n.highlight .ss {\n color: #90a959;\n}"
If this is doable, let me know, and I'll create a ticket. If this is not doable, let me know, and I'll limit myself to HTML and DOC.
You are already doing syntax highlighting, and while I do want to eliminate redundant processing from the PDF, I can let you continue doing so if you're comfortable to. There is no clear efficiency dividend in this case, the way there is in biblio-tag.
@opoudjis I agree with your proposal. I would be better to use common approach/styles in the syntax highlighting. And yes, I can apply CSS to span in PDF. But one restriction - CSS should be simple, without browser-specific properties like moz-...
, etc.
if we enable line numbers,
ok for me too.
Having to modify Metanorma default assumption that all tables must have borders: the tables of Rouge linenumber-display code must not.
The introduction into code snippets of callouts is disruptive (since the code snippets are lexed, and the callouts need to be ignored by the lexer, and then restored as real XML markup.) Asciidoctor skips callouts in processing snippets, and I need to follow suite.
P.S. I'm using the Igor Pro theme of Rouge for CSS, which IMO is the best of a bad bunch—the Rouge styles are either dark background, which is a non-starter for print-like publication, or irritating pastels. https://spsarolkar.github.io/rouge-theme-preview/
We don't want to overdo colouring, but I don't think Igor Pro colours enough: it doesn't differentiate XML tags (class "nt") from content, for example. (See https://github.com/rouge-ruby/rouge/wiki/List-of-tokens for the classes defined by Rouge.)
We can make our own stylesheet, I'm just not convinced it's worthwhile.
We can import a stylesheet from Pygments, which Rouge has ported its CSS from, and Pygments has more themes: https://pygments.org/styles/ . But I'm not convinced that's worth it either.
See how you go with Igor Pro for now, at any rate.
@opoudjis Can you give a hint as to when the recent changes (in this issue, as well as on the line numbering), are going to be available? Just checking to see if I can test today (which I'd like to do, but no rush) - two weeks of holidays ahead of me.
Attempting to upgrade metanorma via Chocolatey just now tells me I already got the latest version available there (1.6.7).
metanorma --version
results:
Metanorma 1.5.3 Metanorma::Cli 1.6.7 Metanorma::Standoc 2.2.8/IsoDoc 2.3.6 Metanorma::ISO 2.2.4 Metanorma::Iec 2.1.13 Metanorma::IEEE 0.1.3 Metanorma::Ietf 3.0.14 Metanorma::Generic 2.2.5 Metanorma::BIPM 2.1.13 Metanorma::CC 2.1.13 Metanorma::Csa 2.1.13 Metanorma::IHO 0.6.13 Metanorma::M3AAWG 2.1.13 Metanorma::UN 0.9.13 Metanorma::Ogc 2.2.7 Metanorma::ITU 2.1.13
Tuesday. I do releases Monday night Australian time every fortnight, and next release is in 3 days.
You can test ahead of time if you set up your Gemfile to pull all of metanorma, metanorma-standoc and isodoc from github, but bad things might happen: i hold off the integration testing until release.
Your last comment is scary enough for me to wait until January. :-) Thanks!
@opoudjis do you have a real example of the source code with line numbers?
This example has wrong structure - first tr
contains line number, and second one contains source code:
<table type="sourcecode">
<tbody>
<tr><td>1<br/>2<br/>3...</td></tr>
<tr><td><sourcecode>...</sourcecode></td></tr>
</tbody>
</table>
The source code sometimes doesn't fit to the page width and automatically carries on the next line, for instance: . In PDF there isn't the horizontal scrollbar for long lines.
Therefore we need to split each source code line into individual tr
with line number, i.e.:
<table type="sourcecode">
<tbody>
...
<tr><td>3</td><td><sourcecode><title language="en" format="text/plain">Testbed-12 OWS Context User Guide</title></sourcecode></td></tr>
...
Rendered PDF example with wrong line numbers:
As I've mentioned, there are two options for Rouge to do line numbers:
<br/>
to separate linesWe have used the former currently.
The disadvantage of the former is that lines overflow and throw alignment off; Rouge actually warns about this in their readme.
The disadvantage of the latter is that you cannot cut and paste the snippet in isolation from the line numbers, something @jechterhoff explicitly said in #460 was desirable about the current Asciidoc pygments solution (as illustrated in https://docs.ogc.org/per/20-012.html#jsonschema_schemaconversionrules_types_classname_typeidentification :
<tbody><tr><td class="linenos"><div class="linenodiv"><pre> 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19</pre></div></td><td class="code"><pre><span></span><span class="tok-p">{</span>
<span class="tok-nt">"$schema"</span><span class="tok-p">:</span> <span class="tok-s2">"http://json-schema.org/draft-07/schema#"</span><span class="tok-p">,</span>
<span class="tok-nt">"definitions"</span><span class="tok-p">:</span> <span class="tok-p">{</span>
<span class="tok-nt">"Type"</span><span class="tok-p">:</span> <span class="tok-p">{</span>
<span class="tok-nt">"properties"</span><span class="tok-p">:</span> <span class="tok-p">{</span>
<span class="tok-nt">"entityType"</span><span class="tok-p">:</span> <span class="tok-p">{</span>
<span class="tok-nt">"type"</span><span class="tok-p">:</span> <span class="tok-s2">"string"</span>
<span class="tok-p">},</span>
<span class="tok-nt">"property"</span><span class="tok-p">:</span> <span class="tok-p">{</span>
<span class="tok-nt">"type"</span><span class="tok-p">:</span> <span class="tok-s2">"string"</span>
<span class="tok-p">}</span>
<span class="tok-p">},</span>
<span class="tok-nt">"required"</span><span class="tok-p">:</span> <span class="tok-p">[</span>
<span class="tok-s2">"entityType"</span><span class="tok-p">,</span> <span class="tok-s2">"property"</span>
<span class="tok-p">]</span>
<span class="tok-p">}</span>
<span class="tok-p">},</span>
<span class="tok-nt">"$ref"</span><span class="tok-p">:</span> <span class="tok-s2">"#/definitions/Type"</span>
<span class="tok-p">}</span>
</pre></td></tr></tbody>
If we switch to the (clearly more correct) approach of a table row per code line, there are one workaround for in HTML, that would allow users to select only the code snippet, as shown in https://stackoverflow.com/questions/35738049/select-only-a-single-column-in-a-html-table:
user-select: none
vs user-select: all
.The second solution https://stackoverflow.com/a/35738248 will indeed work, and is easy for us to insert; but it will only work on HTML, not in DOC. (And @Intelligent2013, I suspect there is no equivalent in PDF; I've certainly found mouse-select behaviour with tables in PDF to be so capricious, I never bother to do it.)
There are Javascript solutions that do this too, e.g. https://stackoverflow.com/a/6619995, but the CSS option is simpler, and we're avoiding JS when we can.
@jechterhoff Are you ok to proceed with a solution that works only for HTML in making copy-paste easy, but still keeps line numbers aligned?
I've tried manually split each source code line into tr
:
<table type="sourcecode">
<tbody>
<tr>
<td>1</td>
<td><sourcecode lang="ruby"><span class="nb">require</span> <span class="s2">"isodoc/ogc/html_convert"</span></sourcecode></td>
</tr>
<tr>
<td>2</td>
<td><sourcecode lang="ruby"><span class="nb">require</span> <span class="s2">"isodoc/ogc/pdf_convert"</span></sourcecode></td>
</tr>
<tr>
<td>3</td>
<td><sourcecode lang="ruby"><span class="nb">require</span> <span class="s2">"isodoc/ogc/word_convert"</span></sourcecode></td>
</tr>
...
and tried to set fox:header="true"
for 1st column with line numbers (found here https://xmlgraphics.apache.org/fop/2.0/accessibility.html#fox:header):
<fo:table-column column-width="8%" fox:header="true"/>
<fo:table-column column-width="92%"/>
but it does not affect on the selection order - both columns selected together:
I don't an ideas how to solve it. May be. again, Apache FOP Intermediate Format post-processing, but if this task is actually a problem...
Let's not make life unnecessarily difficult for you.
People truly do expect weird stuff to happen if they copy paste a table from PDF. If there isn't a quick and easy solution for PDF, we'll ignore it. It's a nice to have, given that expectation. There is at least a workable solution for HTML, which does not involve too much effort.
Needing to change behaviour of converting carriage return to <br/>
in sourcecode; we don't do so if the sourcecode has already been converted into a table with a row per line.
Will finalise and upload tomorrow; there was more to fix here than I'd expected, both with tabular representation of code lines, and with callouts processing, and especially with CSS manipulation to smooth over the introduction of the table.
Callouts and annotations should be styled in comment style.
Word code tables need styling and debugging.
I'm really not in love with Igor Pro, and will export it as an import in isodoc, which can be overruled locally.
Localised IgorPro to add .nt: XML tags
https://coolors.co/444444-cc00a3-ff0000-c34e00-0000ff-007575-009c00 for list of colours, https://github.com/rouge-ruby/rouge/wiki/List-of-tokens for list of classes
Token name | Token shortname | Description | Current stylesheet |
---|---|---|---|
Text | Any type of text data | Onyx | |
Text.Whitespace | w | Specially highlighted whitespace | Onyx |
Error | err | Lexer errors | |
Other | x | Token for data not matched by a parser (e.g. HTML markup in PHP code) | |
Keyword | k | Any keyword | Blue |
Keyword.Constant | kc | Keywords that are constants | Burnt Orange |
Keyword.Declaration | kd | Keywords used for variable declaration (e.g. var in javascript) | Blue |
Keyword.Namespace | kn | Keywords used for namespace declarations | Blue |
Keyword.Pseudo | kp | Keywords that aren't really keywords | Blue |
Keyword.Reserved | kr | Keywords which are reserved (such as end in Ruby) | Skobeloff |
Keyword.Type | kt | Keywords wich refer to a type id (such as int in C) | Blue |
Name | n | Variable/function names | |
Name.Attribute | na | Attributes (in HTML for instance) | |
Name.Builtin | nb | Builtin names which are available in the global namespace | Burnt Orange |
Name.Builtin.Pseudo | bp | Builtin names that are implicit (such as self in Ruby) | Burnt Orange |
Name.Class | nc | For class declaration | |
Name.Constant | no | For constants | |
Name.Decorator | nd | For decorators in languages such as Python or Java | |
Name.Entity | ni | Token for entitites such as in HTML | |
Name.Exception | ne | Exceptions and errors (e.g. ArgumentError in Ruby) | |
Name.Function | nf | Function names | |
Name.Property | py | Token for properties | |
Name.Label | nl | For label names | |
Name.Namespace | nn | Token for namespaces | |
Name.Other | nx | For other names | |
Name.Tag | nt | Tag mainly for markup such as XML or HTML | Blue added |
Name.Variable | nv | Token for variables | |
Name.Variable.Class | vc | Token for class variables (e.g. @@var in Ruby) | |
Name.Variable.Global | vg | For global variables (such as $LOAD_PATH in Ruby) | |
Name.Variable.Instance | vi | Token for instance variables (such as @var in Ruby) | |
Literal | l | Any literal (if not further defined) | |
Literal.Date | ld | Date literals | Slimy Green added |
Literal.String | s | String literals | Slimy Green |
Literal.String.Backtick | sb | String enclosed in backticks | Slimy Green |
Literal.String.Char | sc | Token type for single characters | Slimy Green |
Literal.String.Doc | sd | Documentation strings (such as in Python) | Slimy Green |
Literal.String.Double | s2 | Double quoted strings | Slimy Green |
Literal.String.Escape | se | Escaped sequences in strings | Slimy Green |
Literal.String.Heredoc | sh | For "heredoc" strings (e.g. in Ruby) | Slimy Green |
Literal.String.Interpol | si | For interpoled part in strings (e.g. in Ruby) | Slimy Green |
Literal.String.Other | sx | Token type for any other strings (for example %q{foo} string constructs in Ruby) | Slimy Green |
Literal.String.Regex | sr | Regular expressions literals | Slimy Green |
Literal.String.Single | s1 | Single quoted strings | Slimy Green |
Literal.String.Symbol | ss | Symbols (such as :foo in Ruby) | Slimy Green |
Literal.Number | m | Any number literal (if not further defined) | |
Literal.Number.Float | mf | Float numbers | |
Literal.Number.Hex | mh | Hexadecimal numbers | |
Literal.Number.Integer | mi | Integer literals | |
Literal.Number.Integer.Long | il | Long interger literals | |
Literal.Number.Oct | mo | Octal literals | |
Literal.Number.Hex | mx | Hexadecimal literals | |
Literal.Number.Bin | mb | Binary literals | |
Operator | o | Operators (commonly +, -, /, *) | |
Operator.Word | ow | Word operators (e.g. and) | |
Punctuation | p | Punctuation which is not an operator | |
Comment | c | Single ligne comments | Red |
Comment.Multiline | cm | Mutliline comments | Red |
Comment.Preproc | cp | Preprocessor comments such as <% %> in ERb | Byzantine |
Comment.Single | c1 | Comments that end at the end of the line | Red |
Comment.Special | cs | Special data in comments such as @license in Javadoc | Byzantine |
Generic | g | Unstyled token | |
Generic.Deleted | gd | Token value as deleted | |
Generic.Emph | ge | Token value as emphasized | |
Generic.Error | gr | Token value as an error message | |
Generic.Heading | gh | Token value as a headline | |
Generic.Inserted | gi | Token value as inserted | |
Generic.Output | go | Marked as a program output | |
Generic.Prompt | gp | Marked as a command prompt | |
Generic.Strong | gs | Mark the token value as bold (for rst lexer) | |
Generic.Subheading | gu | Marked as a subheadline | |
Generic.Traceback | gt | Mark the token as a part of an error traceback | |
Generic.Lineno | gl | Line numbers |
@Intelligent2013 just to ensure that we're aligned, please find enclosed updated OGC Presentation XML.
There is a problem with using a <table>
for source code: when people try to copy paste it, the result on the clipboard will include those line numbers.
If you have used GitHub's file compare feature, you notice that a copy and paste does not include the line numbers. Which is the correct behavior. This desired behavior applies to all output formats, including HTML and PDF.
If the line numbers do exist, they should be purely presentational -- they are NOT part of the content, and we should not pollute source code content with tables.
but it will only work on HTML, not in DOC
OGC only cares about HTML and PDF. Technically, OGC should not care about DOC output.
We've discussed this offline. The line numbers are indeed limited to the Presentation XML, as is the syntax colouring. The Semantic XML contains just the source code. The HTML is preventing line numbers being copied, without resorting to Javascript. I am not optimistic there is any solution that will work with PDF.
@Intelligent2013 just to ensure that we're aligned, please find enclosed updated OGC Presentation XML.
@opoudjis thanks, it's ok for me:
I have an idea how to omit line numbers from copy-pasting from PDF. I'll try to render the line numbers as mathml in SVG (or just SVG).
Line numbers on-fly in XSLT converted to mathml:
<math xmlns="http://www.w3.org/1998/Math/MathML">
<mtext>1</mtext>
</math>
and text selection is working for source code only now:
Mathml text renders a bit bolded than main text, so I will investigate it...
Example: a2.presentation.pdf
... I am in complete awe of you @Intelligent2013
Releasing tomorrow, am going on holiday for a week.
Mathml text renders a bit bolded than main text, so I will investigate it...
I think the issue in jEuclid and Batik. I'll try to investigate it in a few hours, but if can't then keep SVG source code line numbers in a bit bolded.
I think the issue in jEuclid and Batik. I'll try to investigate it in a few hours, but if can't then keep SVG source code line numbers in a bit bolded.
I'll solve this issue in https://github.com/metanorma/mn2pdf/issues/117, need more time. Current result is https://github.com/metanorma/metanorma-ogc/issues/465#issuecomment-1366775899.
I'm doing release now; those results are acceptable for the time being.
@opoudjis You asked:
@jechterhoff Are you ok to proceed with a solution that works only for HTML in making copy-paste easy, but still keeps line numbers aligned?
Yes, that would be ok for me. If I understand the conversation correctly, a solution for PDF format may have been found in the meantime. :-) Sorry for my late reply. I've been on holiday for two weeks.
@opoudjis May I ask when / if this enhancement is released for windows users that installed metanorma using Chocolatey?
Executing choco upgrade metanorma -y
did not update any package.
Result of metanorma --version is:
Metanorma 1.5.3 Metanorma::Cli 1.6.7 Metanorma::Standoc 2.2.8/IsoDoc 2.3.6 Metanorma::ISO 2.2.4 Metanorma::Iec 2.1.13 Metanorma::IEEE 0.1.3 Metanorma::Ietf 3.0.14 Metanorma::Generic 2.2.5 Metanorma::BIPM 2.1.13 Metanorma::CC 2.1.13 Metanorma::Csa 2.1.13 Metanorma::IHO 0.6.13 Metanorma::M3AAWG 2.1.13 Metanorma::UN 0.9.13 Metanorma::Ogc 2.2.7 Metanorma::ITU 2.1.13
isodoc is currently on 2.4.2, and metanorma-standoc on 2.3.6. The version you've just updated to dates from November.
@ronaldtse Is choco still being kept up to date?
@opoudjis the Windows package is failing to build because of the recent addition of Rouge, which is broken on Windows.
@CAMOBAP is fixing it at:
FYI missing 1.6.11 release for windows is in progress now https://github.com/metanorma/chocolatey-metanorma/actions/runs/3862089946
Thanks. The upgrade to 1.6.11 was successful.
The weird behavior, that syntax highlighting for JSON did not always work, no longer occurs in my document.
When converting an OGC draft Best Practice document with JSON source blocks, some of these blocks end up in all green text, while others get what looks like proper syntax highlighting. When trying the CodeRay syntax highlighting online, both cases work as expected.
@ronaldtse: Do you have any idea why this could be the case?
Here are some screenshots:
Example 1: definitions schema
Defined in adoc as:
HTML compiled by metanorma looks like this:
The CodeRay online example shows proper syntax highlighting: http://coderay.rubychan.de/rays/9369
Example 2: definitions schema
Defined in adoc as:
HTML compiled by metanorma looks like this:
CodeRay online example: http://coderay.rubychan.de/rays/9370
metanorma --version:
Metanorma 1.5.3 Metanorma::Cli 1.6.7 Metanorma::Standoc 2.2.8/IsoDoc 2.3.6 Metanorma::ISO 2.2.4 Metanorma::Iec 2.1.13 Metanorma::IEEE 0.1.3 Metanorma::Ietf 3.0.14 Metanorma::Generic 2.2.5 Metanorma::BIPM 2.1.13 Metanorma::CC 2.1.13 Metanorma::Csa 2.1.13 Metanorma::IHO 0.6.13 Metanorma::M3AAWG 2.1.13 Metanorma::UN 0.9.13 Metanorma::Ogc 2.2.7 Metanorma::ITU 2.1.13
coderay --version: CodeRay 1.1.3